A spatial study of InPost's deployment strategy of parcel pick-up points in relation to competitors

Authors: Michał Woźniak (385190) & Michał Wrzesiński (385197)

1. Introduction

The e-commerce market has been growing rapidly over the past few years, and the Covid-19 pandemic has further enhanced this effect by making irreversible changes in consumer habits. Currently, one of the fastest growing e-commerce markets in Europe is Poland. Companies such as: Allegro, OLX, Amazon, Aliexpress have an average daily turnover of several million baskets. PWC estimated that in 2026 the gross value of the Polish e-commerce market will be at the level of 162 billion PLN.

In order to handle such volume of parcels, well-functioning logistics companies are required to deliver the ordered parcel within a short time from the order. In Poland, there are several international logistic companies, e.g. DHL, DPD; Poczta Polska is an important local player; however, InPost enjoys the greatest recognition among customers. Its success has two sources: firstly, InPost has developed a network of parcel machines (Paczkomat) which have revolutionized the way we receive parcels; secondly, InPost has perfectly operationalized logistics thus is able to deliver parcels in a very short time.

One of the obvious elements of InPost's successful logistics is an appropriate deployment strategy for parcel pick-up points. This point determines how convenient the InPost service is for the end user. As we suppose, the models that determine the choice of a spatial point to place a Paczkomat are not ideal because they do not naturally take into account constraints such as the inability to lease a given point in space. Thus, departments responsible for the expansion of new pick-up points can often make decisions about placing a parcel machine in an intuitive way (for instance close to the shopping mall), without statistical basis (they probably have their own targets to achieve - to receive bonuses and they do not care about evaluation of their choice in the short/mid-term). In addition, recently, more and more competitors for InPost have appeared (adopting the same model of delivery into parcel machine), such as Ruch or even Poczta Polska. It seems to be logical that they possibly copy the deployment choices of InPost. Therefore, it seems very interesting to analyze how InPost's pick-up points deployment situation looks like compared to the competition. In addition, it is worth verifying whether the deployment determinants of InPost's Paczkomats are consistent with the literature. We are referring here to the Morganti et al. (2014) publication which suggests such indicators of pick-up point deployment: demographic indicators (population density, employment rate, computer ownership, Internet access), centers and nodes for city users ("parameters related to end-consumers' mobility and accessibility to socio-economic activities, in particular end-consumers' use of both public transport and private cars, and the density of retail outlets and commercial services, business and employment sites, cultural and leisure centers and public transportation nodes"), parcels flow within the network (mostly related to the transport system and users preferences).

The purpose of this project is to examine the deployment strategy of InPost parcel pick-up points compared to the competition in multivariate environment. Furthermore, we want to test whether the deployment determinants (herein referred to as control variables) indicated by the literature and our modeling intuition are relevant to InPost's strategy.

We state following research questions:

  1. Does InPost have parcel pick-up points deployed in line with the competition?
  2. Do control variables affect the number of InPost pickup points in a given area?
  3. Are spatial effects significant in multivariate econometric models?

To answer the above research questions, we conducted a full econometric analysis based primarily on spatial statistics and spatial models. In order to simplify the computational complexity of the task, we focused only on two cities in Poland: Warsaw and Cracow. These are two key cities from the point of view of InPost (Warsaw - the capital of Poland, Krakow - where InPost was established). Depending on the modeling approach we used different levels of data aggregation for Spatial dependence models - grid 1km x 1km, for Spatial drift models - point data. We are testing those approaches because in this problem, both Spatial Autocorrelation and Spatial Drift seem to be intuitive phenomena.

We present following table of content for this research:

  1. Introduction
  2. Data gathering
  3. Dataset construction
  4. Spatial visualizations
  5. Spatial statistical explanatory analysis
  6. Non-spatial statistical explanatory analysis
  7. Spatial dependence models
  8. Spatial drift models
  9. Conclusions

2. Data gathering

This part is devoted to data collection process. As the output we obtain raw data which will be transformed to the final dataset in the 3. Dataset construction section.

Generally, we devided data into 4 categories:

  • pick-up points data
  • spatial shapes data
  • demographic data
  • points of interest data.

Pick-up points data comes from websites like: Bliskapaczka.pl and DHL.

Spatial shapes data comes from GUGIK (head office of geodesy and cartography in Poland).

Demographic data are taken from the Inspire repository and it represents indicators for 1km2 grids in Poland.

Finally, we obtained points of interest data from OSM repository. This is amazing site which store snapshots of the OSM in shape files!!!

Import dependencies

In [ ]:
from google_drive_downloader import GoogleDriveDownloader
import requests
import json
import numpy as np
import pandas as pd

%config Completer.use_jedi = False

Utilities

We define some utilities for code reproducibility.

In [ ]:
def download_gd_data_from_dict(dictionary: dict):
    '''
    Download, save and unzip data from Google Drive
    '''
    for i,j in dictionary.items():
        GoogleDriveDownloader.download_file_from_google_drive(file_id=j,
                                            dest_path=f"../datasets/raw_data/{i}/{i}.zip",
                                            unzip=True,
                                            showsize=True,
                                            overwrite=False)

def download_json_data_from_url(name:str, url: str):
    '''
    Download and save data from JSON API outputs
    '''
    response = requests.get(url).text
    df = pd.DataFrame(json.loads(response))
    df.to_csv(f"../datasets/raw_data/{name}.csv")

Download data collected and stored on our Google Drive

We decided to download data from GUIGK, OSM and Inspire (using links attached in the introduction to this stage of study) and store it on our academic Google Drive to obtain reproducibility in any time. Thanks to our functional utilities we can just pass direct link to the file and then download, store and unzip files with the data! Does data are not stored in our remote git repository due to their size, but thanks to Google Drive they are available for anyone!

You can also inspect the file via Browser just combine: https://drive.google.com/file/d/ + file_id, for instace: https://drive.google.com/file/d/1BZCmADIZhJuf1_Jh-f6D8vSpI8p5-2wd

Source: GUIGK

In [ ]:
gugik = {'guigk_voi':'1BZCmADIZhJuf1_Jh-f6D8vSpI8p5-2wd',
        'guigk_pov':'1wX99dmNUbiEKYKh-qAfxDipT9oC6DLzE',
        'guigk_com':'1URjb9NM6Fm_qES5kC4QPPXZGERzarUIa'}

download_gd_data_from_dict(gugik)

Source: OSM

In [ ]:
osm = {'osm_mazowieckie':'195E_n9JlgavFWp4mbaOCHAYKFWziBkc0',
        'osm_malopolskie':'1KG6uPhCZ-jKDgEpBU46WKHXVG_Mc-dBS'}

download_gd_data_from_dict(osm)

Source: Inspire

In [ ]:
inspire = {"inspire":"1avnBMziIn9uLetSbucMrZlZadhnSUvPE"}
download_gd_data_from_dict(inspire)

Scrape pick-up points data from websites

We collect that about pick-up points from two website or to be more precise from their APIs. It is the smartest way to gather this data in seconds!

Source: Bliskapaczka.pl

In [ ]:
url = 'https://pos.bliskapaczka.pl/api/v1/pos?fields=operator%2Ccode%2Clatitude%2Clongitude%2Cbrand%2CbrandPretty%2CoperatorPretty%2Ccod%2Cavailable%2C+city%2C+street&operators=RUCH%2CINPOST%2CPOCZTA%2CDPD%2CUPS%2CFEDEX'
download_json_data_from_url("bliska_paczka", url)

Source: DHL

In [ ]:
url = 'https://parcelshop.dhl.pl/index.php/points?type=lm&country=pl&ptype=parcelShop&hours_from=10&hours_to=16&week_days_PON=T&week_days_WT=T&week_days_SR=T&week_days_CZW=T&week_days_PT=T&week_days_SOB=N&week_days_NIEDZ=N&options_pickup_cod=N&show_on_map_parcelshop=T&show_on_map_parcelstation=T&show_on_map_pok=T&tab=pickup'
download_json_data_from_url("dhl", url)

3. Dataset construction

Import dependencies

In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely import geometry
from pyproj import CRS
import geog
import shapely
from shapely.geometry import Point
import shapely.wkt
import matplotlib.pyplot as plt
import gc

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
%config Completer.use_jedi = False

import warnings
warnings.filterwarnings("ignore")

In this part of the study, we merge all the data into complete datasets. What is worth mentioning is that we collect data for 2 cities: Warszawa and Kraków, both for 1km2 grids and 500m buffers around Inpost points (only for GWR). In the case of merging for grids we took an assumption about intersection of grid with poviat (it was enough for grids to intersect with poviat boundaries). Additionally, we count other data for each grid. In turn, in the case of merging for buffers, we also count data for each buffer and also we use data from Inspire (if more than one grid intersects with buffer we take avarage value from those grids).

Parcel data

In this section, we merge the files for parcel points. Finally, we get a file that contains geolocation data for all available pickup points.

In [2]:
bliska_paczka = pd.read_csv("../datasets/raw_data/bliska_paczka.csv", index_col=0)
bliska_paczka = bliska_paczka.loc[bliska_paczka.available == True, ("brand", "operator", "city", "street", "longitude", "latitude")]
bliska_paczka.brand = bliska_paczka.brand.str.lower()
bliska_paczka.operator = bliska_paczka.operator.str.lower()
bliska_paczka.city = bliska_paczka.city.str.lower()
bliska_paczka.street = bliska_paczka.street.str.lower()
bliska_paczka = bliska_paczka.loc[:, ("brand", "operator", "longitude", "latitude")]
In [3]:
dhl = pd.read_csv("../datasets/raw_data/dhl.csv", index_col=0)
dhl.P_TYPE = dhl.P_TYPE.str.lower() 
dhl = dhl.drop(columns=["ID"])
dhl.columns = ["brand", "latitude", "longitude"]
dhl["operator"] = "dhl"
dhl = dhl[["brand", "operator", "longitude", "latitude"]]
In [4]:
df = pd.concat([bliska_paczka, dhl], axis=0)
In [6]:
df = df.drop(columns=["brand"])
df = gpd.GeoDataFrame(df, geometry=[geometry.Point(xy) for xy in zip(df.longitude, df.latitude)], crs=CRS("epsg:4258"))
df = df.drop(columns=["latitude", "longitude"])
In [9]:
df.to_csv('../datasets/preprocessed_data/pickup_points_by_operator.csv', index = False)

GUGIK and Inspire data

In the next section, we load data for GUGIK and Inspire. GUGIK data is necessary to obtain polygons for 2 poviats (Warszawa and Kraków). In turn, Inspire data provides us with demographic variables that define the population, both total and distinguished by gender and age for 1 km2 grids. We merge both of the above data sets by location, i.e. grids within and intersecting with poviats remain in our data set. Similarly, we add data for parcels of points according to their presence within poviats.

In [10]:
pov = gpd.read_file("../datasets/raw_data/guigk_pov/Powiaty.shx", encoding='utf-8')
pov = pov.loc[pov.JPT_NAZWA_.isin(["powiat Warszawa", "powiat Kraków"])==True, ("JPT_NAZWA_", "geometry")]
pov = pov.to_crs("epsg:4258")

grids = gpd.read_file("../datasets/raw_data/inspire/PD_STAT_GRID_CELL_2011.shp", encoding='utf-8')
grids = grids[['TOT', 'TOT_0_14', 'TOT_15_64', 'TOT_65__', 'TOT_MALE', 'TOT_FEM',
       'MALE_0_14', 'MALE_15_64', 'MALE_65__', 'FEM_0_14', 'FEM_15_64',
       'FEM_65__', 'FEM_RATIO', 'geometry']]
grids = grids.to_crs("epsg:4258")

pov_grids = gpd.sjoin(grids, pov, how="inner", op="intersects")
pov_grids = pov_grids.drop(columns=["index_right"])
pov_grids = pov_grids.reset_index()
pov_grids = pov_grids.rename(columns={"index":"grid_index"})
In [12]:
pov_grids_sliced = pov_grids[["grid_index", "geometry"]].copy() 
df_pov_grids_sliced = gpd.sjoin(df, pov_grids_sliced, how="right", op="within")
df_pov_grids_sliced = df_pov_grids_sliced.drop(columns=["index_left"])
df_pov_grids_sliced = df_pov_grids_sliced.groupby(["operator", "grid_index"], as_index=False).count()
df_pov_grids_sliced = df_pov_grids_sliced.pivot(index='grid_index', columns='operator').fillna(0).reset_index()
df_pov_grids_sliced.columns = df_pov_grids_sliced.columns.droplevel()
df_pov_grids_sliced.columns = ['grid_index', 'dhl', 'dpd', 'fedex', 'inpost', 'poczta', 'ruch', 'ups']
In [13]:
df_pov_grids = pd.merge(pov_grids, df_pov_grids_sliced, how="left", on="grid_index")
df_pov_grids.columns = df_pov_grids.columns.str.lower()
df_pov_grids = df_pov_grids.fillna(0)
df_pov_grids = df_pov_grids[['grid_index','geometry', 'jpt_nazwa_', 'dhl', 'dpd', 'fedex', 'inpost', 'poczta', 'ruch',
                             'ups', 'tot', 'tot_0_14', 'tot_15_64', 'tot_65__', 'tot_male',
                            'tot_fem', 'male_0_14', 'male_15_64', 'male_65__', 'fem_0_14',
                            'fem_15_64', 'fem_65__', 'fem_ratio']]
In [17]:
del bliska_paczka, dhl, pov_grids, pov_grids_sliced, df_pov_grids_sliced; gc.collect();

OSM data

Finally, in order to merge OSM (points of interest) variables to the final dataset, we created a function that loads OSM data for voivodeships. Then we count the points of interest for each grid. Thus, we obtain the final dataset for grids.

In [18]:
def merging_vars(data, poviat, path):
    
    # df for concrete poviat
    df_output = data[data.jpt_nazwa_ == poviat]
    
    # loading points of interest
    buildings = gpd.read_file(path + 'gis_osm_buildings_a_free_1.shp')
    landuse = gpd.read_file(path + 'gis_osm_landuse_a_free_1.shp')
    pois_a = gpd.read_file(path + 'gis_osm_pois_a_free_1.shp')
    railways = gpd.read_file(path + 'gis_osm_railways_free_1.shp')
    roads = gpd.read_file(path + 'gis_osm_roads_free_1.shp')
    traffic_a = gpd.read_file(path + 'gis_osm_traffic_a_free_1.shp')
    traffic = gpd.read_file(path + 'gis_osm_traffic_free_1.shp')
    transport_a = gpd.read_file(path + 'gis_osm_transport_a_free_1.shp')
    
    # concrete points from above dfs
    buildings_points = buildings[buildings['type'].isin(['house', 'residential', 'bungalow', 'apartment'])][['osm_id', 'geometry']]
    shop_points = buildings[buildings['type'].isin(['supermarket', 'bakery', 'kiosk', 'mall', 'department_store', 'convenience', 'clothes', 'florist', 'chemist'])][['osm_id', 'geometry']]
    parks_points = landuse[landuse['fclass'] == 'park'][['osm_id', 'geometry']]
    forest_points = landuse[landuse['fclass'] == 'forest'][['osm_id', 'geometry']]
    schools_points = pois_a[pois_a['fclass'].isin(['school', 'playground'])][['osm_id', 'geometry']]
    railways_points = railways[['osm_id', 'geometry']]
    cycleways_points = roads[roads['fclass'] == 'cycleway'][['osm_id', 'geometry']]
    parking_points = traffic_a[traffic_a['fclass'] == 'parking'][['osm_id', 'geometry']]
    crossing_points = traffic[traffic['fclass'] == 'crossing'][['osm_id', 'geometry']]
    bus_stop_points = transport_a[transport_a['fclass'] == 'bus_stop'][['osm_id', 'geometry']]
    
    # unnecessery dfs
    del buildings, landuse, pois_a, railways, roads, traffic_a, traffic, transport_a; gc.collect()
    
    # changing crs
    buildings_points = buildings_points.to_crs("epsg:4258")
    shop_points = shop_points.to_crs("epsg:4258")
    parks_points = parks_points.to_crs("epsg:4258")
    forest_points = forest_points.to_crs("epsg:4258")
    schools_points = schools_points.to_crs("epsg:4258")
    railways_points = railways_points.to_crs("epsg:4258")
    cycleways_points = cycleways_points.to_crs("epsg:4258")
    parking_points = parking_points.to_crs("epsg:4258")
    crossing_points = crossing_points.to_crs("epsg:4258")
    bus_stop_points = bus_stop_points.to_crs("epsg:4258")
    
    # list of dataframes
    list_of_dfs = [buildings_points, shop_points, parks_points, forest_points, schools_points, railways_points,
                  cycleways_points, parking_points, crossing_points, bus_stop_points]
    
    # names of new columns
    names = ['buildings', 'shops', 'parks', 'forests', 'schools', 'railways',
                  'cycleways', 'parkings', 'crossings', 'bus_stops']
    
    # groupby points in a loop
    for i in range(len(list_of_dfs)):
        actual_point = gpd.sjoin(list_of_dfs[i], df_output, how="inner", op="intersects")
        x = actual_point[['osm_id', 'grid_index']].groupby(['grid_index']).count()
        x.rename(columns={"osm_id": names[i]}, inplace=True)
        x.reset_index(inplace = True)
        df_output = df_output.merge(x, on = 'grid_index', how='outer')
        
    df_output.fillna(0, inplace=True)
    df_output.drop(columns = {'jpt_nazwa_'}, inplace = True)

    return df_output

Saving final dataset for grids

In [19]:
df_krakow = merging_vars(df_pov_grids, 'powiat Kraków', '../datasets/raw_data/osm_malopolskie/')
df_krakow[['tot', 'tot_0_14', 'tot_15_64', 'tot_65__', 'tot_male', 'tot_fem',
       'male_0_14', 'male_15_64', 'male_65__', 'fem_0_14', 'fem_15_64',
       'fem_65__']] = df_krakow[['tot', 'tot_0_14', 'tot_15_64', 'tot_65__', 'tot_male', 'tot_fem',
       'male_0_14', 'male_15_64', 'male_65__', 'fem_0_14', 'fem_15_64',
       'fem_65__']].astype(float)

df_krakow.to_csv('../datasets/preprocessed_data/df_krakow.csv', index = False)
df_krakow.to_file('../datasets/preprocessed_data/df_krakow.shp')
del df_krakow; gc.collect()
Out[19]:
0
In [21]:
df_warszawa = merging_vars(df_pov_grids, 'powiat Warszawa', '../datasets/raw_data/osm_mazowieckie/')
df_warszawa[['tot', 'tot_0_14', 'tot_15_64', 'tot_65__', 'tot_male', 'tot_fem',
       'male_0_14', 'male_15_64', 'male_65__', 'fem_0_14', 'fem_15_64',
       'fem_65__']] = df_warszawa[['tot', 'tot_0_14', 'tot_15_64', 'tot_65__', 'tot_male', 'tot_fem',
       'male_0_14', 'male_15_64', 'male_65__', 'fem_0_14', 'fem_15_64',
       'fem_65__']].astype(float)

df_warszawa.to_csv('../datasets/preprocessed_data/df_warszawa.csv', index = False)
df_warszawa.to_file('../datasets/preprocessed_data/df_warszawa.shp')
del df_warszawa; gc.collect()
Out[21]:
362

Dataset for GWR

When constructing the dataset for GWR, we followed the same steps as for the above dataset. The only difference is that we collected data for 500m buffers.

In [22]:
df_geo = df.copy()
In [23]:
df_geo['point_id'] = list(range(0, df_geo.shape[0]))
In [24]:
pov_df_geo = gpd.sjoin(pov, df_geo, how="inner", op="intersects")
In [25]:
pov_df_geo = pov_df_geo.merge(df_geo, on = 'point_id')
In [26]:
pov_df_geo = pov_df_geo.drop(columns = {'geometry_x', 'index_right', 'operator_y'})
In [27]:
pov_df_geo = pov_df_geo.rename(columns={"geometry_y": "geometry", "JPT_NAZWA_": "jpt_nazwa_", "operator_x": "operator"})
In [28]:
inpost_points = pov_df_geo[pov_df_geo['operator'] == 'inpost']

Creating 500m buffer around each Inpost point

In [29]:
# creating buffer (500m radius)
def buffer(point):

    n_points = 50
    angles = np.linspace(0, 360, n_points)
    radius = 500
    polygon = geog.propagate(point, angles, radius)

    x = polygon.tolist()
    lon = list(list(zip(*x))[0])
    lat = list(list(zip(*x))[1])
    pts = gpd.GeoSeries([Point(x, y) for x, y in zip(lon, lat)])
    poly = geometry.Polygon([[p.x, p.y] for p in pts])
    polyg = gpd.GeoSeries(poly)
    

    return polyg
In [30]:
inpost_points['buffer'] = inpost_points.apply(lambda x: buffer(x['geometry']), axis=1)
inpost_points = inpost_points.rename(columns={"geometry": "center"})
In [31]:
inpost_buffers = gpd.GeoDataFrame(inpost_points, geometry = 'buffer', crs = "epsg:4258")
inpost_buffers = inpost_buffers.reset_index()
inpost_buffers = inpost_buffers.rename(columns={"index":"buffer_index"})

Parcel points

In [32]:
def merge_points(data_inpost, data_operator, operator):
    new_df = data_operator[data_operator['operator'] == operator]
    new_df = gpd.GeoDataFrame(new_df, geometry = 'geometry', crs = "epsg:4258")
    
    actual_point = gpd.sjoin(new_df, data_inpost, how="inner", op="intersects")
    x = actual_point[['operator_left', 'buffer_index']].groupby(['buffer_index']).count()
    x.rename(columns={"operator_left": operator+'_points'}, inplace=True)
    x.reset_index(inplace = True)
    df_output = data_inpost.merge(x, on = 'buffer_index', how='outer')
    df_output.fillna(0, inplace=True)
    
    return df_output
In [33]:
operators = ['inpost', 'poczta', 'dhl', 'ruch', 'dpd', 'ups', 'fedex']
for operator in operators:
    inpost_buffers = merge_points(inpost_buffers, pov_df_geo, operator)

Inspire data

In [34]:
buff = gpd.sjoin(inpost_buffers, grids, how="inner", op="intersects")

x = buff[['buffer_index', 'TOT', 'TOT_0_14', 'TOT_15_64', 'TOT_65__', 'TOT_MALE', 'TOT_FEM',
       'MALE_0_14', 'MALE_15_64', 'MALE_65__', 'FEM_0_14', 'FEM_15_64',
       'FEM_65__', 'FEM_RATIO']].groupby(['buffer_index']).mean()

inpost_buffers = inpost_buffers.merge(x, on='buffer_index')

OSM data

In [35]:
def merge_osm(data, poviat, path):
    
    # df for concrete poviat
    df_output = data[data.jpt_nazwa_ == poviat]
    
    # loading points of interest
    buildings = gpd.read_file(path + 'gis_osm_buildings_a_free_1.shp')
    landuse = gpd.read_file(path + 'gis_osm_landuse_a_free_1.shp')
    pois_a = gpd.read_file(path + 'gis_osm_pois_a_free_1.shp')
    railways = gpd.read_file(path + 'gis_osm_railways_free_1.shp')
    roads = gpd.read_file(path + 'gis_osm_roads_free_1.shp')
    traffic_a = gpd.read_file(path + 'gis_osm_traffic_a_free_1.shp')
    traffic = gpd.read_file(path + 'gis_osm_traffic_free_1.shp')
    transport_a = gpd.read_file(path + 'gis_osm_transport_a_free_1.shp')
    
    # concrete points from above dfs
    buildings_points = buildings[buildings['type'].isin(['house', 'residential', 'bungalow', 'apartment'])][['osm_id', 'geometry']]
    shop_points = buildings[buildings['type'].isin(['supermarket', 'bakery', 'kiosk', 'mall', 'department_store', 'convenience', 'clothes', 'florist', 'chemist'])][['osm_id', 'geometry']]
    parks_points = landuse[landuse['fclass'] == 'park'][['osm_id', 'geometry']]
    forest_points = landuse[landuse['fclass'] == 'forest'][['osm_id', 'geometry']]
    schools_points = pois_a[pois_a['fclass'].isin(['school', 'playground'])][['osm_id', 'geometry']]
    railways_points = railways[['osm_id', 'geometry']]
    cycleways_points = roads[roads['fclass'] == 'cycleway'][['osm_id', 'geometry']]
    parking_points = traffic_a[traffic_a['fclass'] == 'parking'][['osm_id', 'geometry']]
    crossing_points = traffic[traffic['fclass'] == 'crossing'][['osm_id', 'geometry']]
    bus_stop_points = transport_a[transport_a['fclass'] == 'bus_stop'][['osm_id', 'geometry']]
    
    # unnecessery dfs
    del buildings, landuse, pois_a, railways, roads, traffic_a, traffic, transport_a; gc.collect()
    
    # changing crs
    buildings_points = buildings_points.to_crs("epsg:4258")
    shop_points = shop_points.to_crs("epsg:4258")
    parks_points = parks_points.to_crs("epsg:4258")
    forest_points = forest_points.to_crs("epsg:4258")
    schools_points = schools_points.to_crs("epsg:4258")
    railways_points = railways_points.to_crs("epsg:4258")
    cycleways_points = cycleways_points.to_crs("epsg:4258")
    parking_points = parking_points.to_crs("epsg:4258")
    crossing_points = crossing_points.to_crs("epsg:4258")
    bus_stop_points = bus_stop_points.to_crs("epsg:4258")
    
    # list of dataframes
    list_of_dfs = [buildings_points, shop_points, parks_points, forest_points, schools_points, railways_points,
                  cycleways_points, parking_points, crossing_points, bus_stop_points]
    
    # names of new columns
    names = ['buildings', 'shops', 'parks', 'forests', 'schools', 'railways',
                  'cycleways', 'parkings', 'crossings', 'bus_stops']
    
    # groupby points in a loop
    for i in range(len(list_of_dfs)):
        actual_point = gpd.sjoin(list_of_dfs[i], df_output, how="inner", op="intersects")
        x = actual_point[['osm_id', 'buffer_index']].groupby(['buffer_index']).count()
        x.rename(columns={"osm_id": names[i]}, inplace=True)
        x.reset_index(inplace = True)
        df_output = df_output.merge(x, on = 'buffer_index', how='outer')
        
    df_output.fillna(0, inplace=True)
    df_output.drop(columns = {'jpt_nazwa_'}, inplace = True)

    return df_output

Saving final dataset for buffers

In [36]:
df_krakow_inpost = merge_osm(inpost_buffers, 'powiat Kraków', '../datasets/raw_data/osm_malopolskie/')
df_krakow_inpost.to_csv('../datasets/preprocessed_data/df_krakow_gwr.csv', index = False)
In [38]:
del df_krakow_inpost; gc.collect();
In [42]:
df_warszawa_inpost = merge_osm(inpost_buffers, 'powiat Warszawa', '../datasets/raw_data/osm_mazowieckie/')
df_warszawa_inpost.to_csv('../datasets/preprocessed_data/df_warszawa_gwr.csv', index = False)
In [43]:
del df_warszawa_inpost; gc.collect();

4. Spatial visualizations

Import dependencies

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
% matplotlib inline

# geo libraries
import shapely
from shapely.geometry import Point, Polygon
import shapely.wkt
import geopandas as gpd
import contextily as ctx

# colors
from matplotlib import cm
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
from mpl_toolkits.axes_grid1 import make_axes_locatable
import matplotlib.colors as mcolors

In this part, we present visualizations for the main part of our research, i.e. grids data. In the first step, we perform a data transformation. In the second step, we present maps showing interesting relationships for each of the analyzed cities. We focus primarily on confronting Inpost points with their competition, as well as on showing interesting variables, both demographic and determining points of interest.

Taking into account the transformations, we obtain the following variables for visualization:

  • leader - for each grid we check which operator has the most points. In the case of an equal number of points, we choose leader according to the highest total number of points. If no operator has a point in a given grid, then there is no leader.
  • dominance_inpost2 - if Inpost has more own points than 50% of the sum of all other operators - the value is True, otherwise the value is False.

Additionally, we create own colour pallets with make_colormap function.

We present visualizations according to variables, i.e. each visualized feature for 2 cities.

Loading datasets

In [4]:
df_warszawa =pd.read_csv("../datasets/preprocessed_data/df_warszawa.csv")
df_krakow = pd.read_csv("../datasets/preprocessed_data/df_krakow.csv")
df_warszawa['city'] = 'Warszawa'
df_krakow['city'] = 'Krakow'
In [5]:
df_all = pd.concat([df_warszawa, df_krakow])

Transformations of datasets for visualisations

In [6]:
def transform_df(df):
    
    # geo changes
    df['geometry'] = df.apply(lambda x: shapely.wkt.loads(x['geometry']), axis=1)
    df = gpd.GeoDataFrame(df, geometry = 'geometry', crs = "epsg:4258")
    
    # new variables
    df['operators'] = df.apply(lambda x: [x['inpost'], x['poczta'], x['dhl'], x['dpd'], x['ruch'], x['ups'], x['fedex']], axis = 1)
    df['leader'] = df.apply(lambda x: np.max(x['operators']) != 0, axis = 1)
    df['pos'] = df.apply(lambda x: np.where((x['leader'] == 1), np.argmax(x['operators']), 100), axis = 1)
    
    # dictionaries for applying
    dict_pos = {0: 'inpost', 1: 'poczta', 2: 'dhl', 3: 'other', 4: 'other', 5: 'other', 6: 'other', 100: 'no_leader'}
    dict_pos2 = {0: 'inpost', 1: 'poczta', 2: 'dhl', 3: 'dpd', 4: 'ruch', 5: 'ups', 6: 'fedex', 100: 'no_leader'}
    
    # new variables
    df['leader'] = df['pos'].replace(dict_pos)
    df['no_leader'] = 0
    df['leader_number'] = df.apply(lambda x: x[dict_pos2[float(x['pos'])]], axis = 1)
    df['other'] = df['dhl'] + df['dpd'] + df['fedex'] + df['poczta'] + df['ruch'] + df['ups']
    df['dominance_inpost'] = df['inpost'] > df['other']
    df['dominance_inpost2'] = df['inpost'] > 1/2*df['other']
    
    # changing crs (essential change for library contextily to show visualisation on map)
    df = df.to_crs(epsg=3857)
    
    return df
In [7]:
df_warszawa = transform_df(df_warszawa)
df_krakow = transform_df(df_krakow)
df_all = transform_df(df_all)

Own color pallettes

In [8]:
# function from stackoverflow for making own cmap
def make_colormap(seq):
    """Return a LinearSegmentedColormap
    seq: a sequence of floats and RGB-tuples. The floats should be increasing
    and in the interval (0,1).
    """
    seq = [(None,) * 3, 0.0] + list(seq) + [1.0, (None,) * 3]
    cdict = {'red': [], 'green': [], 'blue': []}
    for i, item in enumerate(seq):
        if isinstance(item, float):
            r1, g1, b1 = seq[i - 1]
            r2, g2, b2 = seq[i + 1]
            cdict['red'].append([item, r1, r2])
            cdict['green'].append([item, g1, g2])
            cdict['blue'].append([item, b1, b2])
    return mcolors.LinearSegmentedColormap('CustomMap', cdict)

c = mcolors.ColorConverter().to_rgb
In [9]:
# our own palletes with colours
operators_cmap = make_colormap([c('red'), 0.2, c('magenta'), 0.4, c('grey'), 0.6, c('green'), 0.8, c('blue')])
blues = make_colormap([c('grey'), c('skyblue'), 0.1, c('skyblue'), c('aqua'), 0.2, c('aqua'), c('blue')])
greens = make_colormap([c('grey'), c('lightgreen'), 0.1, c('lightgreen'), c('lime'), 0.4, c('lime'), c('green')])
greys_cmap = make_colormap([c('grey'), c('grey'), 0.33, c('grey'), c('grey'), 0.66, c('grey')])
reds_cmap = make_colormap([c('lightcoral'), c('salmon'), 0.2, c('salmon'), c('indianred'), 0.4, c('indianred'), c('firebrick')])
blues_cmap = make_colormap([c('lightblue'), c('aqua'), 0.2, c('aqua'), c('deepskyblue'), 0.4, c('deepskyblue'), c('blue')])
greens_cmap = make_colormap([c('lightgreen'), c('lime'), 0.2, c('lime'), c('limegreen'), 0.4, c('limegreen'), c('green')])
pinks_cmap = make_colormap([c('violet'), c('magenta'), 0.2, c('magenta'), c('darkviolet'), 0.4, c('darkviolet'), c('purple')])
binary_cmap = make_colormap([c('red'), c('red'), 0.5, c('green'), c('green')])

Function for visualisation

In [10]:
def visualisation_basic(df):
    
    fig, ax = plt.subplots(1, 2, figsize = (30, 10))
    print('\nBasic plot of city')
    ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), alpha = 0.4, edgecolor='black', linewidth=3, legend = True, ax = ax[0]))
    ax[0].set_axis_off()
    ax[0].set_title('Basic plot of Warszawa')
    ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), alpha = 0.4, edgecolor='black', linewidth=3, legend = True, ax = ax[1]))
    ax[1].set_axis_off()
    ax[1].set_title('Basic plot of Kraków')
    plt.plot()
    plt.show()
    
def visualisation_leader(df):    
    fig, ax = plt.subplots(1, 2, figsize = (30, 10))
    print('\nLeader of operators')
    ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), column="leader", categorical = True, legend=True, cmap=operators_cmap, edgecolor='black', alpha = 0.7, ax = ax[0]))
    ax[0].set_axis_off()
    ax[0].set_title('Leader of operators in Warszawa')
    ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), column="leader", categorical = True, legend=True, cmap=operators_cmap, edgecolor='black', alpha = 0.7, ax = ax[1]))
    ax[1].set_axis_off()
    ax[1].set_title('Leader of operators in Kraków')
    plt.plot()
    plt.show()

def visualisation_dominance(df):    
    fig, ax = plt.subplots(1, 2, figsize = (30, 10))
    print('\nDominance of inpost points over all other operators - > 50%')
    ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), column="dominance_inpost2", categorical = True, legend=True, cmap=binary_cmap, edgecolor='white', alpha = 0.7, ax = ax[0]))
    ax[0].set_axis_off()
    ax[0].set_title('Dominance of Inpost points in Warszawa')
    ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), column="dominance_inpost2", categorical = True, legend=True, cmap=binary_cmap, edgecolor='white', alpha = 0.7, ax = ax[1]))
    ax[1].set_axis_off()
    ax[1].set_title('Dominance of Inpost points in Kraków')
    plt.plot()
    plt.show()

def visualisation_total(df):
    fig, ax = plt.subplots(1, 2, figsize = (30, 10))
    print('\nTotal')
    ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), column="tot", legend=True, cmap='Reds', edgecolor='white', alpha = 0.7, ax = ax[0]))
    ax[0].set_axis_off()
    ax[0].set_title('Total population in Warszawa')
    ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), column="tot", legend=True, cmap='Reds', edgecolor='white', alpha = 0.7, ax = ax[1]))
    ax[1].set_axis_off()
    ax[1].set_title('Total population in Kraków')
    plt.plot()
    plt.show()

def visualisation_forests(df):    
    fig, ax = plt.subplots(1, 2, figsize = (30, 10))
    print('\nForests')
    ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), column="forests", legend=True, cmap=greens, edgecolor='white', alpha = 0.7, ax = ax[0]))
    ax[0].set_axis_off()
    ax[0].set_title('Number of forests in Warszawa')
    ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), column="forests", legend=True, cmap=greens, edgecolor='white', alpha = 0.7, ax = ax[1]))
    ax[1].set_axis_off()
    ax[1].set_title('Number of forests in Kraków') 
    plt.plot()
    plt.show()

def visualisation_schools(df):        
    fig, ax = plt.subplots(1, 2, figsize = (30, 10))
    print('\nSchools')
    ctx.add_basemap(df[df.city == 'Warszawa'].plot(figsize=(20,10), column="schools", legend=True, cmap=blues, edgecolor='white', alpha = 0.7, ax = ax[0]))
    ax[0].set_axis_off()
    ax[0].set_title('Number of schools in Warszawa')
    ctx.add_basemap(df[df.city == 'Krakow'].plot(figsize=(20,10), column="schools", legend=True, cmap=blues, edgecolor='white', alpha = 0.7, ax = ax[1]))
    ax[1].set_axis_off()
    ax[1].set_title('Number of schools in Kraków')
    plt.plot()
    plt.show()
    

The figure below shows all cities with the background of OSM maps.

In [11]:
visualisation_basic(df_all)
Basic plot of city

The next figure shows the leader of the operators in each grid. We can see that Poczta Polska dominates the city center (blue color on the map). In turn, Inpost dominates around the center (pink color), i.e. around Poczta Polska points. DHL (red color) and other smaller operators (green color) are the leaders in a few points. On the outskirts of cities, as a rule, there is no leader (gray color).

In [12]:
visualisation_leader(df_all)
Leader of operators

The figure below shows where Inpost dominates over other operators. Green color means that Inpost points account for more than 50% of all other operators. Referring to the previous figure, Inpost's dominance is mainly in those points where Inpost is the leader on its own.

In [13]:
visualisation_dominance(df_all)
Dominance of inpost points over all other operators - > 50%

Then we show how the demographic variable is distributed in cities. As expected, total population is the highest in city centers.

In [14]:
visualisation_total(df_all)
Total

Taking into account forests points, their distribution on the map is very regual. It is noticeable that there is also a lot of green areas on the outskirts of cities.

In [15]:
visualisation_forests(df_all)
Forests

Finally, we show the location of the schools (elementary schools, kindergartens, etc.). As expected, the highest number of schools is in the center, where also the total population is highest.

In [16]:
visualisation_schools(df_all)
Schools

5. Spatial statistical explanatory analysis

This chapter is devoted to Spatial statistical explanatory analysis. We utilize here many spatial statistic tools to understand better spatial dependencies in our datasets! We distinguished such Spatial SEDA stages like:

  1. Spatial weights analysis (based on grid data)
  2. Spatial Lag analysis (based on grid data)
  3. Global Spatial Autocorrelation analysis (based on grid data)
  4. Local Spatial Autocorrelation analysis (based on grid data)
  5. Tessellation and entropy analysis (based on point data)

Importantly, in the study, we considered which weight matrix would be appropriate. We chose between contiguity by ROOK and contiguity by QUEEN. Finally, based on attempts at multivariate spatial econometric modeling, we concluded that QUEEN would be the appropriate matrix (due to spatial terms significance). This solution is basically intuitive in terms of the problem we are addressing.

Import dependencies

In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely import geometry
from shapely import wkt
from libpysal.weights import contiguity
from libpysal.weights import lag_spatial
from pyproj import CRS
import mapclassify as mc
import splot
import esda
import seaborn as sns
from esda.moran import Moran
from splot.esda import moran_scatterplot
from splot.esda import plot_moran
import matplotlib.pyplot as plt
import gc
from esda.moran import Moran_Local
from splot.esda import lisa_cluster
from splot.esda import plot_local_autocorrelation
from splot.libpysal import plot_spatial_weights
from esda.moran import Moran_BV_matrix
from splot.esda import moran_facet
from shapely.ops import cascaded_union
from geovoronoi.plotting import subplot_for_map, plot_voronoi_polys_with_points_in_area
from geovoronoi import voronoi_regions_from_coords, points_to_coords

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
%config Completer.use_jedi = False

Load data

In the first step let us import datasets which are necessary to run analysis: final dataset for Warsaw, Cracow and GUGIK powiats!

In [2]:
def load_shp_from_csv(name:str):
    df = pd.read_csv(f'../datasets/preprocessed_data/{name}.csv')
    df['geometry'] = df['geometry'].apply(wkt.loads)
    df = gpd.GeoDataFrame(df, crs='epsg:4258')
    return df
In [4]:
df_warszawa = load_shp_from_csv("df_warszawa")
df_krakow = load_shp_from_csv("df_krakow")

pov = gpd.read_file("../datasets/raw_data/guigk_pov/Powiaty.shx", encoding='utf-8')
pov = pov.loc[pov.JPT_NAZWA_.isin(["powiat Warszawa", "powiat Kraków"])==True, ("JPT_NAZWA_", "geometry")]
pov = pov.to_crs("epsg:4258")

print(df_warszawa.shape, df_krakow.shape)
ERROR:fiona._env:PROJ: proj_identify: C:\Users\wozni\AppData\Local\Continuum\anaconda3\Library\share\proj\proj.db lacks DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. It comes from another PROJ installation.
ERROR:fiona._env:PROJ: proj_create_from_name: C:\Users\wozni\AppData\Local\Continuum\anaconda3\Library\share\proj\proj.db lacks DATABASE.LAYOUT.VERSION.MAJOR / DATABASE.LAYOUT.VERSION.MINOR metadata. It comes from another PROJ installation.
(601, 32) (396, 32)

Run Spatial SEDA

Here we define one utility function which allow us to run spatial SEDA per each city and stage. Of course in case of clean code it should we written in separate functions, but due to lack of time we left one function.

In [6]:
def spatial_seda(city, city_name, analysis_part):    
    df = city.copy()
    df_facet = df[['poczta', 'dhl', 'inpost', 'dpd', 'ruch']]
    
    # spatial weight matrix
    y = df.inpost.values
    w = contiguity.Queen.from_dataframe(df)
    w.transform = 'r'
    if analysis_part == 1:
        plot_spatial_weights(w, df["geometry"].to_crs("epsg:3395"))
        plt.title("Spatial weights visualization")
        plt.show()

        #general plot, quantiles plot, spatial lag plot
        fig, axs = plt.subplots(1, 3, figsize=(30,10))

        df.plot(column="inpost", legend=True, figsize=(10,10), ax=axs[0])
        axs[0].set_title("InPost")
        axs[0].legend(prop={'size': 2})

        df.plot(column='inpost', scheme='Quantiles', k=5, legend=True, figsize=(10,10), ax=axs[1])
        axs[1].set_title("InPost (Quintiles)")

        ylag = lag_spatial(w, y)
        ylagq5 = mc.Quantiles(ylag, k=4)

        df.assign(cl=ylagq5.yb).plot(column='cl', categorical=True, k=4, linewidth=0.1, ax=axs[2], legend=True, cmap="viridis")
        axs[2].set_title("Spatial Lag InPost (Quintiles)")
        plt.show()
    elif analysis_part == 2:
        #moran global
        mi = esda.moran.Moran(y, w)
        print("Moran's I", mi.I, "\np-value of I under normality assumption", mi.p_norm, "\np-value based on permutations", mi.p_sim)
        plot_moran(mi, zstandard=True, figsize=(10,4))
        plt.show()

        #geary global
        geary = esda.geary.Geary(y, w)
        print("Geary's: ", geary.C, "p-value:", geary.p_sim)

        # joint count global
        y_prim = (y > np.quantile(y, 0.85)) * 1
        df["inpost_bin"] = y_prim
        df.plot(column="inpost_bin", legend=True, figsize=(10,10))
        plt.title("InPost in Joint Count analysis")
        plt.show()
        w.transform = 'O'
        jc = esda.join_counts.Join_Counts(df.inpost_bin, w)
        print("Joint count bb pvalue", jc.p_sim_bb, ";JC bw pvalue", jc.p_sim_bw)
        w.transform = 'R'
    elif analysis_part == 3:
        #moran local
        moran_loc = Moran_Local(y, w)
        fig, ax = moran_scatterplot(moran_loc, p=0.1)
        ax.set_xlabel('InPost')
        ax.set_ylabel('Spatial Lag of InPost')
        plt.show()
        lisa_cluster(moran_loc, df.to_crs("epsg:3395"), p=0.1, figsize = (10, 10), legend=True, )
        plt.show()

        #moran bv analysis
        matrix2 = Moran_BV_matrix(df_facet, w)
        moran_facet(matrix2)
        plt.show()
        
    elif analysis_part == 4:
        #tessellation
        df_point = load_shp_from_csv("pickup_points_by_operator")
        df_point = df_point[df_point.operator=="inpost"]
        curr_pov = pov[pov.JPT_NAZWA_ == city_name]
        df_point = gpd.sjoin(df_point, curr_pov, op='within', how='inner').to_crs(epsg=3395)
        curr_pov = curr_pov.to_crs(epsg=3395)
        boundary_shape = cascaded_union(curr_pov.geometry)
        coords = points_to_coords(df_point.geometry)
        region_polys, region_pts = voronoi_regions_from_coords(coords, boundary_shape)

        fig, ax = subplot_for_map(figsize=(20,10))
        plot_voronoi_polys_with_points_in_area(ax, boundary_shape, region_polys, coords, region_pts)
        plt.show()

        fig, ax = subplot_for_map(figsize=(20,10))
        plot_voronoi_polys_with_points_in_area(ax, boundary_shape, region_polys, coords, points_markersize=0.1, voronoi_color='grey', voronoi_edgecolor='black')
        plt.show()

        df_point = load_shp_from_csv("pickup_points_by_operator")
        stats_per_operator = list()

        for operator in df_point.operator.unique():
            df_point_copy = df_point.copy()
            df_point_copy = df_point_copy[df_point_copy.operator==operator]
            curr_pov = pov[pov.JPT_NAZWA_ == city_name]
            df_point_copy = gpd.sjoin(df_point_copy, curr_pov, op='within', how='inner').to_crs(epsg=3395)
            curr_pov = curr_pov.to_crs(epsg=3395)
            boundary_shape = cascaded_union(curr_pov.geometry)
            coords = points_to_coords(df_point_copy.geometry)
            region_polys, region_pts = voronoi_regions_from_coords(coords, boundary_shape)
            a = np.array([region_polys[i].area for i in range(len(region_polys))])
            a1 = a/sum(a)
            ent1 = sum(-1*a1*np.log(a1))
            n = len(a)
            ent_ref = np.log(1/n)*(-1)
            ent_rel = ent1/ent_ref
            stats_per_operator.append([ent1, ent_rel])

        x = pd.DataFrame(stats_per_operator, index=df_point.operator.unique(), columns=["Shannon entropy", "Relative H entropy"])
        display(x)
        x["Relative H entropy"].plot(figsize=(10,5), title="Relative H entropy per operator")
        plt.show()

Warsaw

Spatial weights analysis, Spatial Lag analysis

We start the analysis by examining the neighbors formed by the spatial weight matrix. We can see that for QUEEN contiguity was well applied.

Then we move to visualizations of InPost pick-up points. The first visualization shows the spatial distribution of InPost pickup points assuming it is a continuous variable. The next plots divides this variable into quintiles, while the last visualization is a spatial lag visualization divided into quintiles. The division into quintiles allows us to better observe the density of InPost pickup points, while the spatial lag, in addition to the natural econometric interpretation (variable that averages the neighboring values of a location), gives us the opportunity to smooth the quantile graph and enhance the impression of value similarity in space. Generally, we can clearly see that the most pickup points are located in the city center, in the Praga's center and on the wester part. On the south and east there are still places with 0 pickup points.

In [8]:
spatial_seda(df_warszawa, "powiat Warszawa", 1)
WARNING:matplotlib.legend:No handles with labels found to put in legend.
C:\Users\wozni\AppData\Local\Continuum\anaconda3\lib\site-packages\mapclassify\classifiers.py:235: UserWarning: Warning: Not enough unique values in array to form k classes
  "Warning: Not enough unique values in array to form k classes", UserWarning
C:\Users\wozni\AppData\Local\Continuum\anaconda3\lib\site-packages\mapclassify\classifiers.py:237: UserWarning: Warning: setting k to 4
  Warn("Warning: setting k to %d" % k_q, UserWarning)

Global Spatial Autocorrelation analysis

Then we move to the global spatial autocorrelation analysis.

First we checked Moran's I statistic. Based on its p-value and statistic we can claim that there is positive spatial autocorrelation in the InPost data. We utilized also Moran Scatterplot to visualize the relation between areas.

Then we checked whether Geary's statistic also indicates positive spatial autocorrelation, and based on its output we also claim that there is spatial autocorrelation in the data.

Last but not least we applied Joint Count test for binary case (85 percentile as as cut-off point for dependent variable) and it also indicated that the number of InPost's pick-up points in grids are not randomly distributed in case of spatial term.

In [14]:
spatial_seda(df_warszawa, "powiat Warszawa", 2)
Moran's I 0.4020725532811403 
p-value of I under normality assumption 0.0 
p-value based on permutations 0.001
Geary's:  0.5984578850664994 p-value: 0.001
Joint count bb pvalue 0.001 ;JC bw pvalue 1.0

Local Spatial Autocorrelation analysis

In the next step we utilized Moran Local measure (LISA) to obtain Moran Local Scatterplot. In Python pysal package it allow us to visualize Hot Spots (HH), Cold Spots (LL), and Spatial Outliers (HL, LH). Again we see that city center is the Hot Spot. It is evident that bedroom suburbs are classified as Cold Spots. HL spots (so called diamons) occur in or near bedroom suburbs, while LH spots occur near the city center. This conclusions are quite intuitive.

Then we ploted bivariate moran scatterplot for our logistic companies. This illustrates that relationships between these variables do indeed exist.

In [15]:
spatial_seda(df_warszawa, "powiat Warszawa", 3)

Tessellation and entropy analysis

Importantly, here we switched to point data from the grid data!!!

The final piece of SEDA is an attempt to use the tessellation mechanism to visualize and examine the agglomeration per each logistic company. Visualizations presented below perfectly show InPost pick-up points concentration and leads to the same conclusions as previously!

Then we utilized Shannon entropy and Relative H entropy to analyse agglomeration per each logistic company. We can see that DHL and Poczta Polska are top agglomerated companies. InPost is in the middle of the ranking. It seems to be the right place because too high agglomeration has a negative impact on accessibility in all parts of Warsaw.

In [16]:
spatial_seda(df_warszawa, "powiat Warszawa", 4)
Shannon entropy Relative H entropy
dpd 5.092312 0.900897
fedex 2.666880 0.922677
inpost 6.198246 0.905680
poczta 6.059140 0.864541
ruch 4.876571 0.870487
ups 4.381173 0.876723
dhl 5.905179 0.856863

Cracow

Spatial weights analysis, Spatial Lag analysis

In the Cracow case weight matrix based on QUEEN contiguity works also better that ROOK (in econometric part). Based on the first plot we see that it was correctly calculated.

Then we analyze visualization of InPost in continuous, quintiles and spatial lag quintiles manor. We can clearly see that pick-up points are located strictly in the city center. The lowest level of coverage is in the eastern and western parts of the city (with an emphasis on the eastern).

In [10]:
spatial_seda(df_krakow, "powiat Kraków", 1)
WARNING:matplotlib.legend:No handles with labels found to put in legend.
C:\Users\wozni\AppData\Local\Continuum\anaconda3\lib\site-packages\mapclassify\classifiers.py:235: UserWarning: Warning: Not enough unique values in array to form k classes
  "Warning: Not enough unique values in array to form k classes", UserWarning
C:\Users\wozni\AppData\Local\Continuum\anaconda3\lib\site-packages\mapclassify\classifiers.py:237: UserWarning: Warning: setting k to 3
  Warn("Warning: setting k to %d" % k_q, UserWarning)

Global Spatial Autocorrelation analysis

Global Spatial Autocorrelation measures like Moran's I and Geary's indicates positive spatial autocorrelation (statistically significant) in the InPost data. Also Joint Count shows spatial autocorrelation for InPost pickup-points.

In [11]:
spatial_seda(df_krakow, "powiat Kraków", 2)
Moran's I 0.521756272206774 
p-value of I under normality assumption 0.0 
p-value based on permutations 0.001
Geary's:  0.47879474971718783 p-value: 0.001
Joint count bb pvalue 0.001 ;JC bw pvalue 1.0

Local Spatial Autocorrelation analysis

LISA allowed us to visualize HH, LL, HL, LH spots in the Cracow. Based on it we can claim that all HH are located in the city center. There are only 2 HL (diamonds). Wester and Easter part of the Cracow are definitely LL.

Then we ploted bivariate moran scatterplot for our logistic companies. This illustrates that relationships between these variables do indeed exist.

In [12]:
spatial_seda(df_krakow, "powiat Kraków", 3)

Tessellation and entropy analysis

Importantly, here we switched to point data from the grid data!!!

Points visualization using tessellation technique presented below perfectly show InPost pick-up points concentration in the city center, thus it leads to the same conclusions as previously!

Then we utilized Shannon entropy and Relative H entropy to analyse agglomeration per each logistic company. We can see that Ruch, Poczta Polska and DHL are top agglomerated companies. InPost again is in the middle of the ranking. It seems to be the right place because too high agglomeration has a negative impact on accessibility in all parts of Warsaw.

In [13]:
spatial_seda(df_krakow, "powiat Kraków", 4)
Shannon entropy Relative H entropy
dpd 3.531301 0.784767
fedex 2.181277 0.850417
inpost 4.854113 0.814311
poczta 4.703335 0.763641
ruch 3.160325 0.741394
ups 3.216875 0.818163
dhl 4.601693 0.767403

6. Nonspatial statistical explanatory analysis

Import dependencies

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
% matplotlib inline

# libraries for geo transformation to save file to shp format
import shapely.wkt
import geopandas as gpd

In this part of the study, we conduct non spatial exploratory data analysis. First, we compute descriptive stats. Then we check which variables should be removed from the dataset. For the remaining variables, we check their distributions with histograms and box plots. Next we calculate the correlation between the variables according to their category, i.e. operators, demographic and points of interest data. Finally, we transform variables if necessary - logarithmization and binarization of variables.

Loading data

In [2]:
df_warszawa = pd.read_csv("../datasets/preprocessed_data/df_warszawa.csv")
df_krakow = pd.read_csv("../datasets/preprocessed_data/df_krakow.csv")
df_warszawa['city'] = 'Warszawa'
df_krakow['city'] = 'Krakow'
In [3]:
df = pd.concat([df_warszawa, df_krakow])
df = df.reset_index()
df = df.drop(columns = {'index'})
features = df.columns.to_list()
features.remove('grid_index')
features.remove('city')
features.remove('geometry')

Descriptive statistics

Considering the descriptive stats table, we can see that we have a division into count data (operators and points of interest data) and a continuous variable (demographic data). In the next steps, we will examine the distributions of these variables and make some transformations.

In [5]:
df[features].describe()
Out[5]:
dhl dpd fedex inpost poczta ruch ups tot tot_0_14 tot_15_64 ... buildings shops parks forests schools railways cycleways parkings crossings bus_stops
count 997.000000 997.000000 997.000000 997.000000 997.000000 997.000000 997.000000 997.000000 997.000000 997.000000 ... 997.000000 997.000000 997.000000 997.000000 997.000000 997.000000 997.000000 997.000000 997.000000 997.000000
mean 1.415246 0.384152 0.031093 1.385155 1.611836 0.352056 0.204614 2543.210632 334.043129 1784.209629 ... 17.356068 0.270812 1.113340 10.199599 2.989970 8.476429 8.178536 14.986961 27.742227 0.544634
std 2.790922 0.846603 0.190213 2.112808 3.041015 0.845154 0.558126 4119.705771 502.378489 2884.319749 ... 52.903305 1.022282 3.576782 10.469754 4.744676 19.672180 14.010474 24.877427 40.056972 1.958727
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 138.000000 22.000000 98.000000 ... 0.000000 0.000000 0.000000 3.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 551.000000 93.000000 396.000000 ... 1.000000 0.000000 0.000000 7.000000 1.000000 0.000000 1.000000 3.000000 8.000000 0.000000
75% 2.000000 0.000000 0.000000 2.000000 2.000000 0.000000 0.000000 2746.000000 450.000000 1934.000000 ... 11.000000 0.000000 1.000000 14.000000 4.000000 6.000000 11.000000 20.000000 41.000000 0.000000
max 28.000000 9.000000 2.000000 13.000000 23.000000 7.000000 5.000000 21531.000000 2672.000000 15020.000000 ... 761.000000 15.000000 62.000000 68.000000 30.000000 133.000000 123.000000 203.000000 248.000000 15.000000

8 rows × 30 columns

Features to remove

The cell below shows what percentage of zero values each variable has in our dataset. If the ratio of zeroes for some variable is higher than 90%, such a variable is removed from our dataset.

In [6]:
to_remove = []
for i in features:
    print("\n ############ \nFeature:", i)
    print(round(df[i].value_counts()[0]/df.shape[0]*100,2), f"% of grids with 0 {i}.")
    print(df[i].nunique(), "unique values.")
    
    if(round(df[i].value_counts()[0]/df.shape[0]*100,2)>90):
        to_remove.append(i)
        features.remove(i)
 ############ 
Feature: dhl
62.29 % of grids with 0 dhl.
18 unique values.

 ############ 
Feature: dpd
76.13 % of grids with 0 dpd.
8 unique values.

 ############ 
Feature: fedex
97.19 % of grids with 0 fedex.
3 unique values.

 ############ 
Feature: poczta
59.88 % of grids with 0 poczta.
21 unique values.

 ############ 
Feature: ruch
79.74 % of grids with 0 ruch.
8 unique values.

 ############ 
Feature: ups
84.55 % of grids with 0 ups.
6 unique values.

 ############ 
Feature: tot
6.42 % of grids with 0 tot.
761 unique values.

 ############ 
Feature: tot_0_14
9.43 % of grids with 0 tot_0_14.
476 unique values.

 ############ 
Feature: tot_15_64
6.52 % of grids with 0 tot_15_64.
724 unique values.

 ############ 
Feature: tot_65__
9.43 % of grids with 0 tot_65__.
442 unique values.

 ############ 
Feature: tot_male
6.92 % of grids with 0 tot_male.
655 unique values.

 ############ 
Feature: tot_fem
7.22 % of grids with 0 tot_fem.
657 unique values.

 ############ 
Feature: male_0_14
11.53 % of grids with 0 male_0_14.
374 unique values.

 ############ 
Feature: male_15_64
7.82 % of grids with 0 male_15_64.
599 unique values.

 ############ 
Feature: male_65__
13.54 % of grids with 0 male_65__.
325 unique values.

 ############ 
Feature: fem_0_14
12.14 % of grids with 0 fem_0_14.
362 unique values.

 ############ 
Feature: fem_15_64
7.82 % of grids with 0 fem_15_64.
616 unique values.

 ############ 
Feature: fem_65__
11.84 % of grids with 0 fem_65__.
379 unique values.

 ############ 
Feature: fem_ratio
6.82 % of grids with 0 fem_ratio.
841 unique values.

 ############ 
Feature: buildings
40.72 % of grids with 0 buildings.
109 unique values.

 ############ 
Feature: shops
87.56 % of grids with 0 shops.
12 unique values.

 ############ 
Feature: parks
67.1 % of grids with 0 parks.
18 unique values.

 ############ 
Feature: forests
5.72 % of grids with 0 forests.
57 unique values.

 ############ 
Feature: schools
46.24 % of grids with 0 schools.
27 unique values.

 ############ 
Feature: railways
55.67 % of grids with 0 railways.
84 unique values.

 ############ 
Feature: cycleways
43.53 % of grids with 0 cycleways.
69 unique values.

 ############ 
Feature: parkings
26.78 % of grids with 0 parkings.
101 unique values.

 ############ 
Feature: crossings
24.37 % of grids with 0 crossings.
144 unique values.

 ############ 
Feature: bus_stops
88.77 % of grids with 0 bus_stops.
16 unique values.
In [7]:
# Variables to_remove
to_remove
Out[7]:
['fedex']
In [8]:
# dropping columns
df = df.drop(columns = to_remove)

Histograms and boxplots for each variable

Using histgrams and a plot box, we examine the distributions of all variables. As expected, most features have similar distributions - very many low values and a very long tail with high values.

In [9]:
plt.rcParams["figure.figsize"] = (20, 5)

for i in features:
    print(f"##############\n{i}\n##############")
    
    fig, ax = plt.subplots(1,2)
    sns.histplot(x = df[i], kde=False, edgecolor= "black", color = 'indianred', linewidth= 1.5, ax=ax[0])
    sns.boxplot(x = df[i], color = 'lightgreen', ax=ax[1])
    plt.show()
##############
dhl
##############
##############
dpd
##############
##############
inpost
##############
##############
poczta
##############
##############
ruch
##############
##############
ups
##############
##############
tot
##############
##############
tot_0_14
##############
##############
tot_15_64
##############
##############
tot_65__
##############
##############
tot_male
##############
##############
tot_fem
##############
##############
male_0_14
##############
##############
male_15_64
##############
##############
male_65__
##############
##############
fem_0_14
##############
##############
fem_15_64
##############
##############
fem_65__
##############
##############
fem_ratio
##############
##############
buildings
##############
##############
shops
##############
##############
parks
##############
##############
forests
##############
##############
schools
##############
##############
railways
##############
##############
cycleways
##############
##############
parkings
##############
##############
crossings
##############
##############
bus_stops
##############

Correlation

Next, we examine the correlation of the variables. Due to the relatively large dataset, we will divide the correlation study into three smaller sets according to the characteristics of the variables.

In [10]:
df[features].corr()
Out[10]:
dhl dpd inpost poczta ruch ups tot tot_0_14 tot_15_64 tot_65__ ... buildings shops parks forests schools railways cycleways parkings crossings bus_stops
dhl 1.000000 0.633123 0.727989 0.932383 0.658593 0.555149 0.696275 0.659447 0.697047 0.643538 ... 0.185447 0.479602 0.344888 -0.133477 0.604301 0.289551 0.540549 0.637894 0.772562 0.279078
dpd 0.633123 1.000000 0.577858 0.678825 0.588176 0.456063 0.619756 0.595723 0.620055 0.568814 ... 0.176436 0.344869 0.324798 -0.116721 0.551852 0.153517 0.366910 0.531104 0.612811 0.170380
inpost 0.727989 0.577858 1.000000 0.726329 0.638070 0.538468 0.766055 0.761541 0.766040 0.688976 ... 0.216337 0.379784 0.264850 -0.140325 0.690957 0.198831 0.526453 0.663942 0.707653 0.267564
poczta 0.932383 0.678825 0.726329 1.000000 0.750532 0.594024 0.771962 0.719797 0.768171 0.736826 ... 0.228131 0.475660 0.380657 -0.153786 0.648610 0.273804 0.526729 0.653151 0.797082 0.322244
ruch 0.658593 0.588176 0.638070 0.750532 1.000000 0.553793 0.724067 0.656740 0.713889 0.725747 ... 0.191366 0.410148 0.390661 -0.092028 0.600039 0.136766 0.366073 0.578649 0.674947 0.172145
ups 0.555149 0.456063 0.538468 0.594024 0.553793 1.000000 0.559764 0.518229 0.560632 0.523815 ... 0.116203 0.275840 0.321317 -0.073834 0.472428 0.089415 0.353809 0.453653 0.524784 0.159706
tot 0.696275 0.619756 0.766055 0.771962 0.724067 0.559764 1.000000 0.957820 0.997314 0.931054 ... 0.259645 0.462499 0.369480 -0.139503 0.839171 0.156490 0.527990 0.726357 0.765618 0.299617
tot_0_14 0.659447 0.595723 0.761541 0.719797 0.656740 0.518229 0.957820 1.000000 0.963425 0.812264 ... 0.255449 0.407450 0.328193 -0.127525 0.835546 0.122008 0.491048 0.718564 0.720328 0.281477
tot_15_64 0.697047 0.620055 0.766040 0.768171 0.713889 0.560632 0.997314 0.963425 1.000000 0.904629 ... 0.257876 0.463317 0.363946 -0.139253 0.839450 0.155397 0.523403 0.732276 0.760263 0.305834
tot_65__ 0.643538 0.568814 0.688976 0.736826 0.725747 0.523815 0.931054 0.812264 0.904629 1.000000 ... 0.241371 0.445171 0.375790 -0.133190 0.752883 0.165183 0.511762 0.634516 0.732433 0.257589
tot_male 0.693253 0.617828 0.766841 0.767557 0.718936 0.558745 0.999442 0.962388 0.998496 0.921275 ... 0.259370 0.460812 0.365806 -0.139202 0.839265 0.154487 0.527636 0.726648 0.762084 0.304905
tot_fem 0.698219 0.620847 0.764730 0.775004 0.727761 0.560139 0.999604 0.953143 0.995456 0.938484 ... 0.259651 0.463520 0.372254 -0.139639 0.838366 0.158040 0.527830 0.725487 0.767931 0.294904
male_0_14 0.658514 0.594178 0.760773 0.718437 0.654441 0.519455 0.956871 0.999386 0.962584 0.810828 ... 0.253776 0.407276 0.328812 -0.126232 0.835176 0.123101 0.490262 0.718481 0.720046 0.281160
male_15_64 0.697335 0.619004 0.764175 0.769222 0.715041 0.561215 0.997117 0.961019 0.999729 0.906069 ... 0.258242 0.464219 0.365813 -0.138932 0.837742 0.157806 0.523447 0.730454 0.760063 0.309934
male_65__ 0.633027 0.569060 0.705071 0.723181 0.718361 0.523726 0.943703 0.837147 0.920806 0.991383 ... 0.242589 0.444004 0.360035 -0.137246 0.763051 0.150532 0.524690 0.641209 0.729029 0.270949
fem_0_14 0.659549 0.596544 0.761349 0.720263 0.658285 0.516242 0.957548 0.999322 0.963033 0.812698 ... 0.256878 0.407095 0.327115 -0.128797 0.834843 0.120720 0.491226 0.717715 0.719680 0.281424
fem_15_64 0.696470 0.620710 0.767354 0.766881 0.712532 0.559855 0.997034 0.965132 0.999784 0.902930 ... 0.257434 0.462300 0.362115 -0.139472 0.840590 0.153176 0.523127 0.733565 0.760096 0.302038
fem_65__ 0.644579 0.564343 0.674608 0.738927 0.724398 0.519870 0.916788 0.791976 0.888567 0.997247 ... 0.238835 0.442435 0.381838 -0.129845 0.741379 0.172239 0.500544 0.625891 0.728770 0.248043
fem_ratio 0.220452 0.188450 0.250751 0.231154 0.190278 0.160709 0.263373 0.260063 0.258972 0.253428 ... 0.115080 0.114745 0.135610 0.059570 0.254167 0.012718 0.199926 0.215878 0.251043 0.087491
buildings 0.185447 0.176436 0.216337 0.228131 0.191366 0.116203 0.259645 0.255449 0.257876 0.241371 ... 1.000000 0.102493 0.164972 -0.039460 0.221718 0.050823 0.164326 0.247324 0.289401 0.111615
shops 0.479602 0.344869 0.379784 0.475660 0.410148 0.275840 0.462499 0.407450 0.463317 0.445171 ... 0.102493 1.000000 0.207970 -0.089857 0.361356 0.099768 0.367379 0.337091 0.405009 0.293301
parks 0.344888 0.324798 0.264850 0.380657 0.390661 0.321317 0.369480 0.328193 0.363946 0.375790 ... 0.164972 0.207970 1.000000 0.167366 0.383377 0.183403 0.228138 0.296986 0.410353 0.106974
forests -0.133477 -0.116721 -0.140325 -0.153786 -0.092028 -0.073834 -0.139503 -0.127525 -0.139253 -0.133190 ... -0.039460 -0.089857 0.167366 1.000000 -0.093377 -0.075119 -0.112591 -0.046918 -0.123844 -0.136761
schools 0.604301 0.551852 0.690957 0.648610 0.600039 0.472428 0.839171 0.835546 0.839450 0.752883 ... 0.221718 0.361356 0.383377 -0.093377 1.000000 0.119677 0.500833 0.736080 0.682733 0.225839
railways 0.289551 0.153517 0.198831 0.273804 0.136766 0.089415 0.156490 0.122008 0.155397 0.165183 ... 0.050823 0.099768 0.183403 -0.075119 0.119677 1.000000 0.190559 0.179569 0.318710 0.155564
cycleways 0.540549 0.366910 0.526453 0.526729 0.366073 0.353809 0.527990 0.491048 0.523403 0.511762 ... 0.164326 0.367379 0.228138 -0.112591 0.500833 0.190559 1.000000 0.523555 0.607747 0.417191
parkings 0.637894 0.531104 0.663942 0.653151 0.578649 0.453653 0.726357 0.718564 0.732276 0.634516 ... 0.247324 0.337091 0.296986 -0.046918 0.736080 0.179569 0.523555 1.000000 0.690039 0.232029
crossings 0.772562 0.612811 0.707653 0.797082 0.674947 0.524784 0.765618 0.720328 0.760263 0.732433 ... 0.289401 0.405009 0.410353 -0.123844 0.682733 0.318710 0.607747 0.690039 1.000000 0.204487
bus_stops 0.279078 0.170380 0.267564 0.322244 0.172145 0.159706 0.299617 0.281477 0.305834 0.257589 ... 0.111615 0.293301 0.106974 -0.136761 0.225839 0.155564 0.417191 0.232029 0.204487 1.000000

29 rows × 29 columns

In [11]:
plt.rcParams["figure.figsize"] = (20, 16)
In [12]:
sns.heatmap(df[features].corr(), annot=True, cmap = 'Reds')
plt.show()

Division into three smaller sets

In [13]:
operators = ['dhl','dpd','inpost','poczta','ruch','ups']
demo = ['tot','tot_0_14','tot_15_64','tot_65__','tot_male','tot_fem','male_0_14','male_15_64','male_65__','fem_0_14','fem_15_64','fem_65__','fem_ratio']
points = ['buildings','shops','parks','forests','schools','railways','cycleways','parkings','crossings']

The correlation between DHL and Poczta Polska is disturbing - as much as 93%! In the next stages of the study, we will decide which variable should possibly be removed during modeling. It should also be noted the relatively high correlation between Poczta Polska and Ruch and between Inpost and DHL.

In [14]:
plt.rcParams["figure.figsize"] = (15, 8)
sns.heatmap(df[operators].corr(), annot=True, cmap = 'Reds')
plt.show()

The correlation between the demographic variables is extremely high. Only the variable that determines the "female ratio" is not correlated with the other variables. Most of these variables should be removed in the subsequent phases of the study.

In [15]:
plt.rcParams["figure.figsize"] = (15, 8)
sns.heatmap(df[demo].corr(), annot=True, cmap = 'Reds')
plt.show()

The last correlation plot is for the points of interest variables. The highest correlation between the variables is for "parkings" and "schools" ~ 74%.

In [16]:
plt.rcParams["figure.figsize"] = (15, 8)
sns.heatmap(df[points].corr(), annot=True, cmap = 'Reds')
plt.show()

Transformations of variables

The function below checks if a variable should be binarized (according to the rule, when the most common value for a variable is over 33% of all observations). If a variable is binarized, the new variable is visualized and the new variable is appended to the entire dataset.

In [17]:
def visualisation_binary(df, var):
    new_name = var + '_binary'
    if(df[var].value_counts().iloc[0] > int(df.shape[0]*0.33)):
        print('\nnew variable: "' + new_name + '"')
        pyk = df[var].value_counts().index[0]
        df[new_name] = df[var].apply(lambda x: 'few_'+var if x == 0  else 'more_'+var)
        plt.figure(figsize=(10, 5))
        df[new_name].value_counts().plot(kind='bar')
        plt.plot()
        plt.show()
    else: print(f'\nno binary transformation for variable "{var}"')
In [18]:
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [10, 5]})
for point in points:
    visualisation_binary(df, point)
new variable: "buildings_binary"
new variable: "shops_binary"
new variable: "parks_binary"
no binary transformation for variable "forests"

new variable: "schools_binary"
new variable: "railways_binary"
new variable: "cycleways_binary"
no binary transformation for variable "parkings"

no binary transformation for variable "crossings"

Then we log the variables for demographic features (due to the fact that they are continuous and their distributions have a long tail). The left-hand side shows the distributions before the transformation, while the right-hand side shows the variables after the transformation.

In [19]:
def visualisation_log(df, var):
    new_name = var + 'log'

    df[new_name] = np.log(1+df[var])

    fig, ax = plt.subplots(1,2)
    sns.histplot(x = df[var], kde=False, edgecolor= "black", color = 'indianred', linewidth= 1.5, ax=ax[0]).set_title(f"{var}", fontsize=20)
    sns.histplot(x = df[new_name], kde=False, edgecolor= "black", color = 'lightgreen', linewidth= 1.5, ax=ax[1]).set_title(f"Logarithm of {var}", fontsize=20)
    plt.plot()
    plt.show()
In [20]:
demo_logarithms = ['tot','tot_0_14','tot_15_64','tot_65__','tot_male','tot_fem','male_0_14','male_15_64','male_65__','fem_0_14','fem_15_64','fem_65__']
In [21]:
sns.set(style='whitegrid', palette="deep", font_scale=1.1, rc={"figure.figsize": [20, 5]})
for point in demo_logarithms:
    visualisation_log(df, point)
In [23]:
df.to_csv("../datasets/preprocessed_data/df_after_transform.csv", index=False)

Transformations for GWR

Above transformation will also be applied for the GWR datasets.

In [22]:
df_warszawa = pd.read_csv("../datasets/preprocessed_data/df_warszawa_gwr.csv")
df_krakow = pd.read_csv("../datasets/preprocessed_data/df_krakow_gwr.csv")
df_warszawa['city'] = 'Warszawa'
df_krakow['city'] = 'Krakow'

df_gwr = pd.concat([df_warszawa, df_krakow])
df_gwr = df_gwr.reset_index()
df_gwr = df_gwr.drop(columns = {'index'})
In [23]:
demo_logarithms_gwr = ['TOT','TOT_0_14','TOT_15_64','TOT_65__','TOT_MALE','TOT_FEM','MALE_0_14','MALE_15_64','MALE_65__','FEM_0_14','FEM_15_64','FEM_65__']
In [24]:
for point in points:
    visualisation_binary(df_gwr, point)
for point in demo_logarithms_gwr:
    visualisation_log(df_gwr, point)
no binary transformation for variable "buildings"

new variable: "shops_binary"
no binary transformation for variable "parks"

no binary transformation for variable "forests"

no binary transformation for variable "schools"

new variable: "railways_binary"
no binary transformation for variable "cycleways"

no binary transformation for variable "parkings"

no binary transformation for variable "crossings"
In [ ]:
df_gwr['geometry'] = df_gwr.apply(lambda x: shapely.wkt.loads(x['center']), axis=1)
df_gwr = gpd.GeoDataFrame(df_gwr, geometry = 'geometry', crs = "epsg:4258")
df_gwr.to_file("../datasets/preprocessed_data/df_gwr_after_transform.shp")
df_gwr[df_gwr.city == 'Warszawa'].to_file("../datasets/preprocessed_data/df_warszawa_for_gwr.shp")
df_gwr[df_gwr.city == 'Krakow'].to_file("../datasets/preprocessed_data/df_krakow_for_gwr.shp")

7. Spatial dependence models

Warning! This chapter is realized in R

In this chapter we create spatial dependence econometric models to answer our research hypothesis. We believe based on spatial autocorrelation tests (performed in previous chapters) that spatial autocorrelation exists in the analyzed phenomenon. We create models for Warsaw and Cracow separately. We utilize here 1km2 grid!

Generally, we specific following functional form of the models: $inpost_i = \alpha * \text{logistic_companies}_i + \beta * \text{demographic_variables}_i + \gamma * \text{point_of_interest_variables}_i + \theta * \text{spatial_terms}_i + \epsilon$,

where:

  • logistic_companies cover: DHL, DPD, Fedex, Poczta Polska, Ruch and UPS
  • demographic_variables cover: total density, density age 0-14, density age 15-64, density age 65-..., total mail density, total female density, female density age 0-14, female density age 15-64, female density age 65-..., male density age 0-14, male density age 15-64, male density age 65-..., feminization ratio
  • point_of_interest_variables cover: buildings, shops, parks, forests, schools, railways, cycleways, parkings, crossings, bus stops

Obviously, a significant proportion of these variables will be removed from the final model due to high co-linearity or statistical insignificance, however, based on the literature and our intuition, we assume that at least 1-2 variables from each category will ultimately be significant.

Our expectations for the results are as follows (taking into account the research hypotheses). Based on the visual analysis of spatial data carried out in the previous chapters of this work, it seems to us that InPost has deployed parcel pick-up points in accordance with the competition. Therefore, we expect the variables in the logistic_companies category to be significant and have a positive sign. In addition, it seems to us that the control variables (groups: demographic_variables, point_of_interest_variables) should also be important in the models and their sign should follow the logic: the more shops in the grid, the more parcel machines, the greater the afforestation, the fewer parcel machines, the more bus stops the more parcel machines, the higher the population density, the more parcel machines etc. In addition, we believe that extending the OLS model with spatial components is a good idea and models that take into account spillover effects will be econometrically better. We apply these expectations to both Warsaw and Cracow.

We adopted the following modeling strategy due to the complexity of the problem:

  1. OLS model estimation to establish an initial functional form that is free of collinearity (VIF statistic) of the variables along with model diagnostics.
  2. Estimation of the final OLS model using Stepwise Regression procedure (both directions).
  3. Estimation of spatial models with 3, 2, 1 spatial components (using obtimate functional form from point 2).
  4. Comparing the statistical properties of the models from step 3 and selecting the target model. We look for model with significant spatial components, highest number of significant variables and minimum AIC (we also take into account BIC).
  5. For the obtimate model in terms of spatial components, we test other combinations of variables using the general to specific (LSE) procedure.
  6. Interpret the results of the model from point 5

We covered all possible options to improve the models:

  • Changing spatial components (done - all possible components tried)
  • Changing the spatial weights matrix (done - QUEEN and ROOK tested)
  • Changing variables in the model (done - general to specific approach to make a choice for OLS and then infer based on it)

We set 10% significance level!

Import dependencies

In [72]:
library(rgdal)
library(spdep)
library(tidyverse)
library(car)
library(lmtest)
library(arules)
library(spatialreg)
library(texreg)
library(stargazer)

Models for Warsaw

Data loading, spatial transformations and additional feature engineering

In [2]:
df <- readOGR("../datasets/preprocessed_data/df_warszawa.shp")
df_data <- read.csv("../datasets/preprocessed_data/df_warszawa.csv")
df <- spTransform(df, CRS("+proj=longlat +ellps=GRS80 +no_defs"))
df.cont.nb <- poly2nb(as(df, "SpatialPolygons"), queen=TRUE) #QUEEN but ROOK was tried!
df.cont.listw <- nb2listw(df.cont.nb, style="W")
OGR data source with driver: ESRI Shapefile 
Source: "D:\spatial_econometric_project\datasets\preprocessed_data\df_warszawa.shp", layer: "df_warszawa"
with 601 features
It has 31 fields
Integer64 fields read as strings:  grid_index 
In [73]:
cat(colnames(df_data))
grid_index geometry dhl dpd fedex inpost poczta ruch ups tot tot_0_14 tot_15_64 tot_65__ tot_male tot_fem male_0_14 male_15_64 male_65__ fem_0_14 fem_15_64 fem_65__ fem_ratio buildings shops parks forests schools railways cycleways parkings crossings bus_stops

We investigate again correlation between logistic companies. We can clearly see that Poczta Polska is highly collinear with other variables so we will remove it from the further analysis. The rest variables seems to be ok.

In [3]:
stats::cor(df_data %>% select('dhl', 'dpd', 'fedex', 'inpost', 'poczta', 'ruch', 'ups'))
dhldpdfedexinpostpocztaruchups
dhl1.0000000 0.6277199 0.183632650.7380562 0.9238711 0.6356515 0.53454845
dpd0.6277199 1.0000000 0.135327480.5657193 0.6875670 0.5941321 0.43918066
fedex0.1836326 0.1353275 1.000000000.2048367 0.1740457 0.1312876 0.08168199
inpost0.7380562 0.5657193 0.204836651.0000000 0.7411384 0.6494003 0.53922648
poczta0.9238711 0.6875670 0.174045730.7411384 1.0000000 0.7454525 0.57380202
ruch0.6356515 0.5941321 0.131287610.6494003 0.7454525 1.0000000 0.53308523
ups0.5345484 0.4391807 0.081681990.5392265 0.5738020 0.5330852 1.00000000

Potential enhancement of the features.

In [4]:
df$tot_0_14 <- df$tot_0_14/df$tot
df$tot_15_64 <- df$tot_15_64/df$tot
df$tot_65__ <- df$tot_65__/df$tot

df$male_0_14 <- df$male_0_14/df$tot_male 
df$male_15_64 <- df$male_15_64/df$tot_male 
df$male_65__ <- df$male_65__/df$tot_male 

df$fem_0_14 <- df$fem_0_14/df$tot_fem 
df$fem_15_64 <- df$fem_15_64/df$tot_fem 
df$fem_65__ <- df$fem_65__/df$tot_fem 

OLS models

Based on many tries we decided that below formula is the most appropriate general formula for the given problem. We removed Poczta Polska from logistic companies, we left total density and feminization ratio in case of demographic data and finally we leave all data regarding points of interest.

In [5]:
formula = inpost ~ dhl + dpd + fedex + ruch + ups +
            log(tot + 1) + fem_ratio + 
            buildings + shops + parks + forests + schools + 
            railways + cycleways + parkings + crossings + bus_stops

We estimate OLS model and we perform its diagnostics. We see that nearly 50% of the variables are significant (2 or more per each category of data). We do not have now high collinearity. However this model did not pass Ramsey and Breusch–Pagan test so the functional form is wrong and we have got heteroscedasticity in the errors. It occurred that at the level of 10% significance our residuals are spatially dependent!

In [6]:
ols <- lm(formula, data=df)
summary(ols)
car::vif(ols)
bptest(ols) 
resettest(ols, power=2, type="regressor")
lm.morantest(ols, df.cont.listw) 
Call:
lm(formula = formula, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6926 -0.5928 -0.1367  0.3983  5.4677 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.095e-01  1.768e-01  -0.619 0.536066    
dhl           2.425e-01  3.221e-02   7.529 1.95e-13 ***
dpd           4.465e-02  7.660e-02   0.583 0.560192    
fedex         4.634e-01  3.063e-01   1.513 0.130902    
ruch          3.813e-01  8.948e-02   4.262 2.37e-05 ***
ups           3.945e-01  1.120e-01   3.521 0.000463 ***
log(tot + 1)  1.909e-01  3.953e-02   4.830 1.75e-06 ***
fem_ratio    -4.774e-03  2.312e-03  -2.065 0.039352 *  
buildings    -5.622e-05  9.314e-04  -0.060 0.951885    
shops        -3.495e-03  9.629e-02  -0.036 0.971055    
parks        -5.541e-02  1.443e-02  -3.839 0.000137 ***
forests      -6.320e-03  5.571e-03  -1.134 0.257150    
schools       7.159e-02  1.838e-02   3.895 0.000110 ***
railways      4.179e-03  3.056e-03   1.368 0.171990    
cycleways    -4.947e-04  5.660e-03  -0.087 0.930380    
parkings      1.420e-03  3.715e-03   0.382 0.702542    
crossings     3.810e-03  2.529e-03   1.507 0.132412    
bus_stops     3.841e-01  3.343e-01   1.149 0.251004    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.307 on 583 degrees of freedom
Multiple R-squared:  0.6858,	Adjusted R-squared:  0.6767 
F-statistic: 74.87 on 17 and 583 DF,  p-value: < 2.2e-16
dhl
3.43074019528456
dpd
1.95978922947825
fedex
1.0698836897454
ruch
2.6144021482233
ups
1.66193714988798
log(tot + 1)
3.75608916762092
fem_ratio
2.22705414708662
buildings
1.1403314263769
shops
1.76526175900298
parks
1.4312130818324
forests
1.16414512840716
schools
3.2249545670834
railways
1.24378280865441
cycleways
1.80434409053292
parkings
3.54942158012825
crossings
4.52584196399078
bus_stops
1.03639735191893
	studentized Breusch-Pagan test

data:  ols
BP = 141.3, df = 17, p-value < 2.2e-16
	RESET test

data:  ols
RESET = 5.2474, df1 = 17, df2 = 566, p-value = 7.66e-11
	Global Moran I for regression residuals

data:  
model: lm(formula = formula, data = df)
weights: df.cont.listw

Moran I statistic standard deviate = 1.6831, p-value = 0.04618
alternative hypothesis: greater
sample estimates:
Observed Moran I      Expectation         Variance 
    0.0291485488    -0.0063435194     0.0004446764 

Then we applied Stepwise Regression procedure to obtain the most significant model starting from the function form defined above.

In [7]:
intercept_only <- lm(inpost~1, data=df)
all <- lm(formula, data=df)
both <- step(intercept_only, direction='both', scope=formula(all), trace=0)
both
Call:
lm(formula = inpost ~ dhl + schools + log(tot + 1) + ruch + parks + 
    ups + crossings + fem_ratio + fedex, data = df)

Coefficients:
 (Intercept)           dhl       schools  log(tot + 1)          ruch  
   -0.144470      0.258789      0.075243      0.192578      0.379413  
       parks           ups     crossings     fem_ratio         fedex  
   -0.058957      0.379756      0.004715     -0.004973      0.482499  

Based on Stepwise Regression we obtained final formula of the model and we are able to run our final OLS with diagnostic part. We see that nearly all variables in the model are significant (of course due to the heteroscedasticity we use biased error variance estimator - we omitted application of the robust matrix due to lack of time). There is no collinearity in the model, but again we did not pass Ramsey and B-P tests. What is more residuals from the model are still spatially dependent. In case of interpretation we can see that: logistic companies have positive sign as expected; the greater the total density the more InPost pick-up points are stated; fem_ratio results is interesting because its sign is negative (more woman less pick-up points, quite spurious?!), in addition, the negative relationship between parks and pickup points is interesting but maybe intuitive.

In [8]:
formula = inpost ~ dhl + schools + log(tot + 1) + ruch + parks + 
    ups + crossings + fem_ratio + fedex
ols <- lm(formula, data=df)
summary(ols)
cat("AIC", AIC(ols))
car::vif(ols)
bptest(ols) 
resettest(ols, power=2, type="regressor")
lm.morantest(ols, df.cont.listw) 
Call:
lm(formula = formula, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9841 -0.6204 -0.1485  0.3748  5.7877 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.144470   0.171182  -0.844 0.399034    
dhl           0.258789   0.028132   9.199  < 2e-16 ***
schools       0.075243   0.015650   4.808 1.94e-06 ***
log(tot + 1)  0.192578   0.039205   4.912 1.17e-06 ***
ruch          0.379413   0.081565   4.652 4.07e-06 ***
parks        -0.058957   0.013272  -4.442 1.06e-05 ***
ups           0.379756   0.108958   3.485 0.000528 ***
crossings     0.004715   0.002238   2.106 0.035612 *  
fem_ratio    -0.004973   0.002298  -2.164 0.030876 *  
fedex         0.482499   0.303603   1.589 0.112541    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.303 on 591 degrees of freedom
Multiple R-squared:  0.683,	Adjusted R-squared:  0.6782 
F-statistic: 141.5 on 9 and 591 DF,  p-value: < 2.2e-16
AIC 2035.98
dhl
2.63025185365071
schools
2.34901560332611
log(tot + 1)
3.71236593931246
ruch
2.1828715504893
parks
1.21605920953507
ups
1.57969063002112
crossings
3.56330324631023
fem_ratio
2.21206367597018
fedex
1.05589585603365
	studentized Breusch-Pagan test

data:  ols
BP = 133.08, df = 9, p-value < 2.2e-16
	RESET test

data:  ols
RESET = 6.8392, df1 = 9, df2 = 582, p-value = 2.223e-09
	Global Moran I for regression residuals

data:  
model: lm(formula = formula, data = df)
weights: df.cont.listw

Moran I statistic standard deviate = 1.4688, p-value = 0.07094
alternative hypothesis: greater
sample estimates:
Observed Moran I      Expectation         Variance 
    0.0268763136    -0.0042751978     0.0004497972 

Spatial models with 3, 2, or 1 spatial components

Now we moved to the estimation of the spatial models. In the first step we estimate all possible models due to spatial terms and then we analyze which model is reasonable (based on number of significant variables, significant spatial terms and AIC+BIC).

In [9]:
GNS <- sacsarlm(formula, data=df, listw=df.cont.listw, type="sacmixed", method="LU")
summary(GNS)
Call:sacsarlm(formula = formula, data = df, listw = df.cont.listw, 
    type = "sacmixed", method = "LU")

Residuals:
     Min       1Q   Median       3Q      Max 
-4.65144 -0.59387 -0.10078  0.39765  5.29613 

Type: sacmixed 
Coefficients: (numerical Hessian approximate standard errors) 
                   Estimate Std. Error z value  Pr(>|z|)
(Intercept)      -0.3096406  0.3759817 -0.8236  0.410194
dhl               0.2228032  0.0321699  6.9258 4.335e-12
schools           0.0805577  0.0174675  4.6119 3.991e-06
log(tot + 1)      0.1356923  0.0439068  3.0905  0.001998
ruch              0.4126336  0.0829722  4.9732 6.587e-07
parks            -0.0388209  0.0159319 -2.4367  0.014823
ups               0.3522209  0.1076515  3.2719  0.001068
crossings         0.0067908  0.0026043  2.6076  0.009119
fem_ratio        -0.0038325  0.0023313 -1.6440  0.100181
fedex             0.6562935  0.3025839  2.1690  0.030085
lag.dhl           0.1202994  0.1140963  1.0544  0.291715
lag.schools      -0.0409694  0.0355450 -1.1526  0.249072
lag.log(tot + 1)  0.1711298  0.1043324  1.6402  0.100956
lag.ruch         -0.5786937  0.2098404 -2.7578  0.005820
lag.parks        -0.0495603  0.0343919 -1.4410  0.149572
lag.ups          -0.1687182  0.2763260 -0.6106  0.541480
lag.crossings    -0.0065111  0.0046852 -1.3897  0.164615
lag.fem_ratio    -0.0063947  0.0054941 -1.1639  0.244458
lag.fedex        -0.1503513  0.9066266 -0.1658  0.868286

Rho: 0.29692
Approximate (numerical Hessian) standard error: 0.23463
    z-value: 1.2654, p-value: 0.20571
Lambda: -0.30141
Approximate (numerical Hessian) standard error: 0.30838
    z-value: -0.97738, p-value: 0.32838

LR test value: 27.493, p-value: 0.0038685

Log likelihood: -993.2434 for sacmixed model
ML residual variance (sigma squared): 1.5565, (sigma: 1.2476)
Number of observations: 601 
Number of parameters estimated: 22 
AIC: 2030.5, (AIC for lm: 2036)
In [10]:
SAC <- sacsarlm(formula, data=df, listw=df.cont.listw)
summary(SAC)
Call:sacsarlm(formula = formula, data = df, listw = df.cont.listw)

Residuals:
     Min       1Q   Median       3Q      Max 
-4.77462 -0.60148 -0.15324  0.41148  5.76569 

Type: sac 
Coefficients: (asymptotic standard errors) 
               Estimate Std. Error z value  Pr(>|z|)
(Intercept)  -0.1721734  0.1729029 -0.9958 0.3193566
dhl           0.2477643  0.0287298  8.6239 < 2.2e-16
schools       0.0723287  0.0158275  4.5698 4.881e-06
log(tot + 1)  0.1809632  0.0391950  4.6170 3.893e-06
ruch          0.3864329  0.0807050  4.7882 1.683e-06
parks        -0.0596580  0.0133850 -4.4571 8.308e-06
ups           0.3753236  0.1079086  3.4782 0.0005049
crossings     0.0041734  0.0023173  1.8010 0.0717103
fem_ratio    -0.0047245  0.0022738 -2.0778 0.0377256
fedex         0.4830808  0.2999343  1.6106 0.1072621

Rho: 0.074328
Asymptotic standard error: 0.064193
    z-value: 1.1579, p-value: 0.24691
Lambda: 0.024735
Asymptotic standard error: 0.10089
    z-value: 0.24518, p-value: 0.80632

LR test value: 2.8919, p-value: 0.23552

Log likelihood: -1005.544 for sac model
ML residual variance (sigma squared): 1.6611, (sigma: 1.2889)
Number of observations: 601 
Number of parameters estimated: 13 
AIC: 2037.1, (AIC for lm: 2036)
In [11]:
SDM <- errorsarlm(formula, data=df, listw=df.cont.listw, etype="mixed")
summary(SDM)
Call:errorsarlm(formula = formula, data = df, listw = df.cont.listw, 
    etype = "mixed")

Residuals:
      Min        1Q    Median        3Q       Max 
-4.757531 -0.597547 -0.095155  0.389789  5.263433 

Type: error 
Coefficients: (asymptotic standard errors) 
                   Estimate Std. Error z value  Pr(>|z|)
(Intercept)      -0.4234932  0.4373072 -0.9684  0.332839
dhl               0.2223463  0.0310939  7.1508 8.629e-13
schools           0.0800192  0.0169938  4.7087 2.493e-06
log(tot + 1)      0.1461070  0.0423166  3.4527  0.000555
ruch              0.3927065  0.0797844  4.9221 8.562e-07
parks            -0.0423316  0.0153067 -2.7656  0.005683
ups               0.3401743  0.1063629  3.1982  0.001383
crossings         0.0068412  0.0025220  2.7126  0.006676
fem_ratio        -0.0043878  0.0022628 -1.9391  0.052495
fedex             0.6458184  0.3077289  2.0987  0.035847
lag.dhl           0.2297732  0.0727347  3.1591  0.001583
lag.schools      -0.0208795  0.0358180 -0.5829  0.559939
lag.log(tot + 1)  0.2490517  0.0863627  2.8838  0.003929
lag.ruch         -0.4778735  0.2308265 -2.0703  0.038427
lag.parks        -0.0753979  0.0289574 -2.6038  0.009221
lag.ups          -0.0353482  0.2830698 -0.1249  0.900623
lag.crossings    -0.0068383  0.0051760 -1.3212  0.186450
lag.fem_ratio    -0.0081216  0.0058392 -1.3909  0.164264
lag.fedex         0.1775282  0.9899517  0.1793  0.857678

Lambda: 0.042353, LR test value: 0.28328, p-value: 0.59456
Asymptotic standard error: 0.076145
    z-value: 0.55622, p-value: 0.57806
Wald statistic: 0.30938, p-value: 0.57806

Log likelihood: -993.5147 for error model
ML residual variance (sigma squared): 1.5969, (sigma: 1.2637)
Number of observations: 601 
Number of parameters estimated: 21 
AIC: 2029, (AIC for lm: 2027.3)
In [12]:
SDEM <- errorsarlm(formula, data=df, listw=df.cont.listw, etype="emixed")
summary(SDEM)
Call:errorsarlm(formula = formula, data = df, listw = df.cont.listw, 
    etype = "emixed")

Residuals:
      Min        1Q    Median        3Q       Max 
-4.757531 -0.597547 -0.095155  0.389789  5.263433 

Type: error 
Coefficients: (asymptotic standard errors) 
                   Estimate Std. Error z value  Pr(>|z|)
(Intercept)      -0.4234932  0.4373072 -0.9684  0.332839
dhl               0.2223463  0.0310939  7.1508 8.629e-13
schools           0.0800192  0.0169938  4.7087 2.493e-06
log(tot + 1)      0.1461070  0.0423166  3.4527  0.000555
ruch              0.3927065  0.0797844  4.9221 8.562e-07
parks            -0.0423316  0.0153067 -2.7656  0.005683
ups               0.3401743  0.1063629  3.1982  0.001383
crossings         0.0068412  0.0025220  2.7126  0.006676
fem_ratio        -0.0043878  0.0022628 -1.9391  0.052495
fedex             0.6458184  0.3077289  2.0987  0.035847
lag.dhl           0.2297732  0.0727347  3.1591  0.001583
lag.schools      -0.0208795  0.0358180 -0.5829  0.559939
lag.log(tot + 1)  0.2490517  0.0863627  2.8838  0.003929
lag.ruch         -0.4778735  0.2308265 -2.0703  0.038427
lag.parks        -0.0753979  0.0289574 -2.6038  0.009221
lag.ups          -0.0353482  0.2830698 -0.1249  0.900623
lag.crossings    -0.0068383  0.0051760 -1.3212  0.186450
lag.fem_ratio    -0.0081216  0.0058392 -1.3909  0.164264
lag.fedex         0.1775282  0.9899517  0.1793  0.857678

Lambda: 0.042353, LR test value: 0.28328, p-value: 0.59456
Asymptotic standard error: 0.076145
    z-value: 0.55622, p-value: 0.57806
Wald statistic: 0.30938, p-value: 0.57806

Log likelihood: -993.5147 for error model
ML residual variance (sigma squared): 1.5969, (sigma: 1.2637)
Number of observations: 601 
Number of parameters estimated: 21 
AIC: 2029, (AIC for lm: 2027.3)
In [13]:
SEM <- errorsarlm(formula, data=df, listw=df.cont.listw)
summary(SEM)
Call:errorsarlm(formula = formula, data = df, listw = df.cont.listw)

Residuals:
     Min       1Q   Median       3Q      Max 
-4.81062 -0.60202 -0.18188  0.39096  5.84370 

Type: error 
Coefficients: (asymptotic standard errors) 
               Estimate Std. Error z value  Pr(>|z|)
(Intercept)  -0.1301997  0.1734180 -0.7508  0.452782
dhl           0.2530099  0.0283798  8.9151 < 2.2e-16
schools       0.0758115  0.0158034  4.7972 1.609e-06
log(tot + 1)  0.1823923  0.0394561  4.6227 3.788e-06
ruch          0.3885116  0.0808737  4.8039 1.556e-06
parks        -0.0560362  0.0134822 -4.1563 3.234e-05
ups           0.3819184  0.1078447  3.5414  0.000398
crossings     0.0050275  0.0022632  2.2214  0.026325
fem_ratio    -0.0046251  0.0022768 -2.0314  0.042218
fedex         0.4883269  0.2992225  1.6320  0.102682

Lambda: 0.097709, LR test value: 1.5928, p-value: 0.20693
Asymptotic standard error: 0.074295
    z-value: 1.3152, p-value: 0.18846
Wald statistic: 1.7296, p-value: 0.18846

Log likelihood: -1006.194 for error model
ML residual variance (sigma squared): 1.6639, (sigma: 1.2899)
Number of observations: 601 
Number of parameters estimated: 12 
AIC: 2036.4, (AIC for lm: 2036)
In [20]:
SAR <- lagsarlm(formula, data=df, listw=df.cont.listw)
summary(SAR)
Call:lagsarlm(formula = formula, data = df, listw = df.cont.listw)

Residuals:
     Min       1Q   Median       3Q      Max 
-4.79287 -0.60136 -0.15215  0.41224  5.74634 

Type: lag 
Coefficients: (asymptotic standard errors) 
               Estimate Std. Error z value  Pr(>|z|)
(Intercept)  -0.1786277  0.1706989 -1.0464 0.2953537
dhl           0.2478015  0.0282846  8.7610 < 2.2e-16
schools       0.0717215  0.0156378  4.5864 4.509e-06
log(tot + 1)  0.1821701  0.0389206  4.6806 2.861e-06
ruch          0.3845874  0.0806541  4.7684 1.857e-06
parks        -0.0605122  0.0132188 -4.5778 4.700e-06
ups           0.3736604  0.1078399  3.4650 0.0005303
crossings     0.0040075  0.0022673  1.7675 0.0771456
fem_ratio    -0.0047848  0.0022728 -2.1052 0.0352740
fedex         0.4814659  0.3002205  1.6037 0.1087785

Rho: 0.084055, LR test value: 2.8325, p-value: 0.092373
Asymptotic standard error: 0.048482
    z-value: 1.7337, p-value: 0.082964
Wald statistic: 3.0059, p-value: 0.082964

Log likelihood: -1005.574 for lag model
ML residual variance (sigma squared): 1.6611, (sigma: 1.2888)
Number of observations: 601 
Number of parameters estimated: 12 
AIC: 2035.1, (AIC for lm: 2036)
LM test for residual autocorrelation
test value: 0.059256, p-value: 0.80768
In [15]:
SLX <- lmSLX(formula, data=df, listw=df.cont.listw)
summary(SLX)
Call:
lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
    data = as.data.frame(x), weights = weights)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8073 -0.5970 -0.0874  0.3925  5.2600 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -0.419836   0.434336  -0.967 0.334137    
dhl               0.223368   0.031732   7.039 5.47e-12 ***
schools           0.080110   0.017349   4.618 4.78e-06 ***
log.tot...1.      0.146202   0.043149   3.388 0.000751 ***
ruch              0.392732   0.081145   4.840 1.67e-06 ***
parks            -0.041995   0.015633  -2.686 0.007431 ** 
ups               0.340498   0.108101   3.150 0.001717 ** 
crossings         0.006810   0.002574   2.645 0.008378 ** 
fem_ratio        -0.004387   0.002300  -1.907 0.056960 .  
fedex             0.647918   0.311972   2.077 0.038254 *  
lag.dhl           0.226797   0.073165   3.100 0.002030 ** 
lag.schools      -0.021032   0.035888  -0.586 0.558083    
lag.log.tot...1.  0.252093   0.086542   2.913 0.003718 ** 
lag.ruch         -0.496170   0.230863  -2.149 0.032031 *  
lag.parks        -0.075354   0.029070  -2.592 0.009777 ** 
lag.ups          -0.039089   0.283833  -0.138 0.890509    
lag.crossings    -0.006459   0.005184  -1.246 0.213252    
lag.fem_ratio    -0.008335   0.005853  -1.424 0.154978    
lag.fedex         0.135883   0.995723   0.136 0.891500    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.285 on 582 degrees of freedom
Multiple R-squared:  0.6968,	Adjusted R-squared:  0.6874 
F-statistic: 74.31 on 18 and 582 DF,  p-value: < 2.2e-16

Based on the above outputs we can clearly state that the best spatial model is SAR (Spatial Lag Model) - significant spatial terms and many significant variables.

Then we compare models BIC and AIC. We can clearly see that the best model in comparison to the OLS is SAR, however its BIC is greater that OLS BIC. In case of AIC which is more important the best model is SLX. However based on its output we can claim that SAR is still better in case of econometric inference.

In [30]:
cat("ols",BIC(ols) ,'\n')
cat("GNS",BIC(GNS),'\n')
cat("SAC",BIC(SAC),'\n')
cat("SDM",BIC(SDM),'\n')
cat("SDEM",BIC(SDEM),'\n')
cat("SAR",BIC(SAR),'\n')
cat("SLX",BIC(SLX),'\n')
cat("SEM",BIC(SEM),'\n')
cat('\n')
cat("ols",AIC(ols) ,'\n')
cat("GNS",AIC(GNS),'\n')
cat("SAC",AIC(SAC),'\n')
cat("SDM",AIC(SDM),'\n')
cat("SDEM",AIC(SDEM),'\n')
cat("SAR",AIC(SAR),'\n')
cat("SLX",AIC(SLX),'\n')
cat("SEM",AIC(SEM),'\n')
ols 2084.365 
GNS 2127.256 
SAC 2094.27 
SDM 2121.4 
SDEM 2121.4 
SAR 2087.931 
SLX 2115.285 
SEM 2089.17 

ols 2035.98 
GNS 2030.487 
SAC 2037.088 
SDM 2029.029 
SDEM 2029.029 
SAR 2035.147 
SLX 2027.313 
SEM 2036.387 

Now we wanted to find better functional form for the SAR model in the general to specific procedure (LSA) done by hand, however we were not able to improve this model (our previous functional form was the best one). Residuals in the model are not autocorrelated.

In [27]:
formula_max = inpost ~ dhl + dpd + fedex + ruch + ups +
            log(tot + 1) + fem_ratio + 
            buildings + shops + parks + forests + schools + 
            railways + cycleways + parkings + crossings + bus_stops

formula_gen_to_specifc = inpost ~ dhl + fedex + ruch + ups +
            log(tot + 1) + fem_ratio + 
            parks + schools + 
            crossings

SAR_g2s<- lagsarlm(formula_gen_to_specifc, data=df, listw=df.cont.listw)
summary(SAR_g2s) #the same as previously!
Call:
lagsarlm(formula = formula_gen_to_specifc, data = df, listw = df.cont.listw)

Residuals:
     Min       1Q   Median       3Q      Max 
-4.79287 -0.60136 -0.15215  0.41224  5.74634 

Type: lag 
Coefficients: (asymptotic standard errors) 
               Estimate Std. Error z value  Pr(>|z|)
(Intercept)  -0.1786277  0.1706989 -1.0464 0.2953537
dhl           0.2478015  0.0282846  8.7610 < 2.2e-16
fedex         0.4814659  0.3002205  1.6037 0.1087785
ruch          0.3845874  0.0806541  4.7684 1.857e-06
ups           0.3736604  0.1078399  3.4650 0.0005303
log(tot + 1)  0.1821701  0.0389206  4.6806 2.861e-06
fem_ratio    -0.0047848  0.0022728 -2.1052 0.0352740
parks        -0.0605122  0.0132188 -4.5778 4.700e-06
schools       0.0717215  0.0156378  4.5864 4.509e-06
crossings     0.0040075  0.0022673  1.7675 0.0771456

Rho: 0.084055, LR test value: 2.8325, p-value: 0.092373
Asymptotic standard error: 0.048482
    z-value: 1.7337, p-value: 0.082964
Wald statistic: 3.0059, p-value: 0.082964

Log likelihood: -1005.574 for lag model
ML residual variance (sigma squared): 1.6611, (sigma: 1.2888)
Number of observations: 601 
Number of parameters estimated: 12 
AIC: 2035.1, (AIC for lm: 2036)
LM test for residual autocorrelation
test value: 0.059256, p-value: 0.80768

Direct and indirect impacts

Now we calculate direct and indirect impacts of the model. And we finally analyze its output.

Let's start with logistic companies variables. We see that either in case of direct and indirect impacts all parameters for logistic companies (apart indirect Fedex) are significant (at least around 10%). Internalization effects are greater than spill-over, but still spill-over effects are significant. In case of demographic variables we claim that total population density is important variable for the model in case of direct and indirect impact (has positive sign). Feminization ratio in case of direct impact is also significant and has negative sign (it is quite non intuitive). In case of point of interest data we see that number of schools, number of parks and number of crossings are significant in the model (crossings spillover effect is not relevant) with respectively positive, negative and positive sign. We claim that it might be intuitive.

In [31]:
# distribution of total impact 
# vector of traces of powers of a spatial weights matrix
# converting censored listw to CsparseMatrix form
W.c <- as(as_dgRMatrix_listw(df.cont.listw), "CsparseMatrix") 
# the default values for the number of powers is 30
trMat<-trW(W.c, type="mult") 

SAR_imp<-impacts(SAR, tr=trMat, R=200)
summary(SAR_imp, zstats=TRUE, short=TRUE)
Impact measures (lag, trace):
                   Direct      Indirect        Total
dhl           0.248049502  0.0224925064  0.270542008
schools       0.071793286  0.0065100350  0.078303321
log(tot + 1)  0.182352374  0.0165352557  0.198887629
ruch          0.384972296  0.0349083218  0.419880618
parks        -0.060572761 -0.0054925860 -0.066065347
ups           0.374034348  0.0339164961  0.407950844
crossings     0.004011499  0.0003637527  0.004375252
fem_ratio    -0.004789587 -0.0004343077 -0.005223894
fedex         0.481947745  0.0437018122  0.525649557
========================================================
Simulation results ( variance matrix):
========================================================
Simulated standard errors
                  Direct     Indirect       Total
dhl          0.024577792 0.0129272070 0.028335501
schools      0.015146244 0.0039862622 0.016354111
log(tot + 1) 0.040420150 0.0102503478 0.044770293
ruch         0.088578780 0.0225102250 0.099527465
parks        0.012891407 0.0034663042 0.014522086
ups          0.113698964 0.0211674881 0.123919020
crossings    0.002284907 0.0002921432 0.002466855
fem_ratio    0.002397949 0.0003680842 0.002656685
fedex        0.301055300 0.0459967224 0.336405595

Simulated z-values:
                Direct  Indirect     Total
dhl           9.966821  1.774916  9.454823
schools       4.912906  1.728297  4.971320
log(tot + 1)  4.403271  1.630457  4.348724
ruch          4.345967  1.613528  4.232815
parks        -4.649420 -1.643065 -4.519525
ups           3.300633  1.657163  3.311489
crossings     1.757779  1.229201  1.773702
fem_ratio    -1.918089 -1.203378 -1.898014
fedex         1.596473  1.029195  1.569434

Simulated p-values:
             Direct     Indirect Total     
dhl          < 2.22e-16 0.075912 < 2.22e-16
schools      8.9736e-07 0.083935 6.6498e-07
log(tot + 1) 1.0663e-05 0.103005 1.3693e-05
ruch         1.3866e-05 0.106630 2.3078e-05
parks        3.3287e-06 0.100369 6.1978e-06
ups          0.00096467 0.097487 0.00092801
crossings    0.07878510 0.218996 0.07611251
fem_ratio    0.05509969 0.228830 0.05769430
fedex        0.11038315 0.303388 0.11654685

Models for Cracow

For the Cracow we applied the same procedure as for Warsaw, so we will omit here additional comments and focus only on the results.

Data loading, spatial transformations and additional feature engineering

In [36]:
df <- readOGR("../datasets/preprocessed_data/df_krakow.shp")
df_data <- read.csv("../datasets/preprocessed_data/df_krakow.csv")
df <- spTransform(df, CRS("+proj=longlat +ellps=GRS80 +no_defs"))
df.cont.nb <- poly2nb(as(df, "SpatialPolygons"), queen=TRUE) #QUEEN but ROOK was tried!
df.cont.listw <- nb2listw(df.cont.nb, style="W")
OGR data source with driver: ESRI Shapefile 
Source: "D:\spatial_econometric_project\datasets\preprocessed_data\df_krakow.shp", layer: "df_krakow"
with 396 features
It has 31 fields
Integer64 fields read as strings:  grid_index 
In [37]:
cat(colnames(df_data))
grid_index geometry dhl dpd fedex inpost poczta ruch ups tot tot_0_14 tot_15_64 tot_65__ tot_male tot_fem male_0_14 male_15_64 male_65__ fem_0_14 fem_15_64 fem_65__ fem_ratio buildings shops parks forests schools railways cycleways parkings crossings bus_stops
In [38]:
stats::cor(df_data %>% select('dhl', 'dpd', 'fedex', 'inpost', 'poczta', 'ruch', 'ups'))
dhldpdfedexinpostpocztaruchups
dhl1.0000000 0.6382852 0.365399440.6853909 0.9543118 0.7208735 0.59141000
dpd0.6382852 1.0000000 0.152202650.5960094 0.6589668 0.5101946 0.48551713
fedex0.3653994 0.1522026 1.000000000.2529419 0.2965195 0.2026232 0.06370109
inpost0.6853909 0.5960094 0.252941891.0000000 0.6777870 0.5782590 0.51096506
poczta0.9543118 0.6589668 0.296519490.6777870 1.0000000 0.7762247 0.63031294
ruch0.7208735 0.5101946 0.202623170.5782590 0.7762247 1.0000000 0.59953072
ups0.5914100 0.4855171 0.063701090.5109651 0.6303129 0.5995307 1.00000000
In [39]:
df$tot_0_14 <- df$tot_0_14/df$tot
df$tot_15_64 <- df$tot_15_64/df$tot
df$tot_65__ <- df$tot_65__/df$tot

df$male_0_14 <- df$male_0_14/df$tot_male 
df$male_15_64 <- df$male_15_64/df$tot_male 
df$male_65__ <- df$male_65__/df$tot_male 

df$fem_0_14 <- df$fem_0_14/df$tot_fem 
df$fem_15_64 <- df$fem_15_64/df$tot_fem 
df$fem_65__ <- df$fem_65__/df$tot_fem 

OLS models

In [40]:
formula = inpost ~ dhl + dpd + fedex + ruch + ups +
            log(tot + 1) + fem_ratio + 
            buildings + shops + parks + forests + schools + 
            railways + cycleways + parkings + crossings + bus_stops
In [41]:
ols <- lm(formula, data=df)
summary(ols)
car::vif(ols)
bptest(ols) 
resettest(ols, power=2, type="regressor")
lm.morantest(ols, df.cont.listw) 
Call:
lm(formula = formula, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5947 -0.3813 -0.1155  0.2777  4.5829 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -2.488e-01  2.005e-01  -1.241 0.215269    
dhl           1.396e-01  5.011e-02   2.786 0.005611 ** 
dpd           2.882e-01  1.181e-01   2.441 0.015086 *  
fedex         3.097e-01  2.544e-01   1.217 0.224222    
ruch          1.630e-01  1.278e-01   1.275 0.202973    
ups           3.471e-01  1.401e-01   2.478 0.013649 *  
log(tot + 1)  1.759e-01  3.619e-02   4.860 1.72e-06 ***
fem_ratio    -4.458e-03  2.012e-03  -2.216 0.027299 *  
buildings     1.653e-04  1.429e-03   0.116 0.907966    
shops        -2.642e-02  4.412e-02  -0.599 0.549580    
parks        -1.900e-01  4.962e-02  -3.828 0.000151 ***
forests      -1.475e-03  4.636e-03  -0.318 0.750610    
schools       1.162e-01  2.165e-02   5.369 1.38e-07 ***
railways      3.185e-05  2.674e-03   0.012 0.990503    
cycleways     1.733e-02  4.848e-03   3.576 0.000395 ***
parkings      1.348e-02  3.347e-03   4.029 6.77e-05 ***
crossings    -1.951e-04  4.568e-03  -0.043 0.965949    
bus_stops     2.966e-02  2.497e-02   1.188 0.235638    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9239 on 378 degrees of freedom
Multiple R-squared:  0.7301,	Adjusted R-squared:  0.718 
F-statistic: 60.16 on 17 and 378 DF,  p-value: < 2.2e-16
dhl
5.91530699443968
dpd
2.09119602039824
fedex
1.25663490762429
ruch
2.59382177999011
ups
1.84912107764761
log(tot + 1)
2.9741467267538
fem_ratio
1.57709281465724
buildings
1.29008503139089
shops
1.61869388677658
parks
2.61338556051656
forests
1.13210023260207
schools
3.06269293498438
railways
1.32307665061535
cycleways
2.72966242316714
parkings
2.21290088213628
crossings
7.1572960902755
bus_stops
2.47137276155454
	studentized Breusch-Pagan test

data:  ols
BP = 111.49, df = 17, p-value = 6.313e-16
	RESET test

data:  ols
RESET = 7.3525, df1 = 17, df2 = 361, p-value = 1.501e-15
	Global Moran I for regression residuals

data:  
model: lm(formula = formula, data = df)
weights: df.cont.listw

Moran I statistic standard deviate = 0.86543, p-value = 0.1934
alternative hypothesis: greater
sample estimates:
Observed Moran I      Expectation         Variance 
    0.0119504900    -0.0105986646     0.0006788895 
In [45]:
intercept_only <- lm(inpost~1, data=df)
all <- lm(formula, data=df)
both <- step(intercept_only, direction='both', scope=formula(all), trace=0)

Based on Stepwise Regression we obtained final formula of the model and we are able to run our final OLS with diagnostic part. We see that nearly all variables in the model are significant (of course due to the heteroscedasticity we use biased error variance estimator - we omitted application of the robust matrix due to lack of time). There is no collinearity in the model, but again we did not pass Ramsey and B-P tests. What is more residuals from the model are not spatially dependent. In case of interpretation we can see that: logistic companies have positive sign as expected; the greater the total density the more InPost pick-up points are stated; fem_ratio results is interesting because its sign is negative (more woman less pick-up points), in addition, the negative relationship between parks and pickup points is interesting but intuitive. There is positive relationship between pickup points and schools, parkings, cycleways and bus stops. It is also intuitive!

In [46]:
formula = formula(both)
ols <- lm(formula, data=df)
summary(ols)
cat("AIC", AIC(ols))
car::vif(ols)
bptest(ols) 
resettest(ols, power=2, type="regressor")
lm.morantest(ols, df.cont.listw) 
Call:
lm(formula = formula, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8521 -0.3876 -0.1289  0.2793  4.4299 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.257783   0.182849  -1.410  0.15940    
schools       0.122166   0.020100   6.078 2.93e-09 ***
dhl           0.171940   0.036511   4.709 3.48e-06 ***
parkings      0.012623   0.003094   4.080 5.48e-05 ***
cycleways     0.017109   0.004353   3.931  0.00010 ***
log(tot + 1)  0.173857   0.035023   4.964 1.04e-06 ***
parks        -0.193973   0.047366  -4.095 5.14e-05 ***
ups           0.361201   0.132367   2.729  0.00665 ** 
dpd           0.261543   0.115585   2.263  0.02421 *  
fem_ratio    -0.004469   0.001989  -2.247  0.02523 *  
bus_stops     0.036833   0.023741   1.551  0.12161    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9197 on 385 degrees of freedom
Multiple R-squared:  0.7276,	Adjusted R-squared:  0.7206 
F-statistic: 102.9 on 10 and 385 DF,  p-value: < 2.2e-16
AIC 1070.335
schools
2.66526956909113
dhl
3.16904435818374
parkings
1.9086031413205
cycleways
2.22102488709486
log(tot + 1)
2.81120907739961
parks
2.40316337562723
ups
1.66611441778565
dpd
2.02317947356491
fem_ratio
1.55568601607867
bus_stops
2.25474958116342
	studentized Breusch-Pagan test

data:  ols
BP = 115.78, df = 10, p-value < 2.2e-16
	RESET test

data:  ols
RESET = 11.021, df1 = 10, df2 = 375, p-value < 2.2e-16
	Global Moran I for regression residuals

data:  
model: lm(formula = formula, data = df)
weights: df.cont.listw

Moran I statistic standard deviate = 0.71454, p-value = 0.2374
alternative hypothesis: greater
sample estimates:
Observed Moran I      Expectation         Variance 
    0.0113325967    -0.0074311675     0.0006895866 

Spatial models with 3, 2, or 1 spatial components

In [47]:
GNS <- sacsarlm(formula, data=df, listw=df.cont.listw, type="sacmixed", method="LU")
summary(GNS)
Call:sacsarlm(formula = formula, data = df, listw = df.cont.listw, 
    type = "sacmixed", method = "LU")

Residuals:
      Min        1Q    Median        3Q       Max 
-3.754389 -0.361068 -0.076336  0.271624  3.595976 

Type: sacmixed 
Coefficients: (numerical Hessian approximate standard errors) 
                   Estimate Std. Error z value  Pr(>|z|)
(Intercept)      -0.1892880  0.4461054 -0.4243 0.6713380
schools           0.1208664  0.0227048  5.3234 1.019e-07
dhl               0.1962639  0.0445578  4.4047 1.059e-05
parkings          0.0059313  0.0036499  1.6251 0.1041487
cycleways         0.0164590  0.0049380  3.3331 0.0008587
log(tot + 1)      0.1349492  0.0378908  3.5615 0.0003687
parks            -0.1756511  0.0494775 -3.5501 0.0003851
ups               0.4202907  0.1360512  3.0892 0.0020069
dpd               0.2239553  0.1174352  1.9071 0.0565136
fem_ratio        -0.0037182  0.0019435 -1.9132 0.0557240
bus_stops         0.0285210  0.0258514  1.1033 0.2699121
lag.schools       0.0062202  0.0832470  0.0747 0.9404381
lag.dhl          -0.1137251  0.1105301 -1.0289 0.3035237
lag.parkings      0.0203799  0.0079271  2.5709 0.0101433
lag.cycleways    -0.0017016  0.0119220 -0.1427 0.8865048
lag.log(tot + 1)  0.0746430  0.0880310  0.8479 0.3964844
lag.parks        -0.1449582  0.1437940 -1.0081 0.3134082
lag.ups           0.6999156  0.5261605  1.3302 0.1834419
lag.dpd           0.1489114  0.4033567  0.3692 0.7119932
lag.fem_ratio    -0.0039038  0.0058527 -0.6670 0.5047704
lag.bus_stops     0.0168665  0.0687687  0.2453 0.8062525

Rho: -0.052812
Approximate (numerical Hessian) standard error: 0.19295
    z-value: -0.27372, p-value: 0.7843
Lambda: 0.098125
Approximate (numerical Hessian) standard error: 0.18232
    z-value: 0.53819, p-value: 0.59044

LR test value: 18.533, p-value: 0.10044

Log likelihood: -513.9013 for sacmixed model
ML residual variance (sigma squared): 0.78334, (sigma: 0.88506)
Number of observations: 396 
Number of parameters estimated: 24 
AIC: 1075.8, (AIC for lm: 1070.3)
In [49]:
SAC <- sacsarlm(formula, data=df, listw=df.cont.listw)
summary(SAC)
Call:sacsarlm(formula = formula, data = df, listw = df.cont.listw)

Residuals:
      Min        1Q    Median        3Q       Max 
-3.742778 -0.378442 -0.082933  0.251057  4.336691 

Type: sac 
Coefficients: (asymptotic standard errors) 
               Estimate Std. Error z value  Pr(>|z|)
(Intercept)  -0.2285383  0.1732757 -1.3189  0.187193
schools       0.1173015  0.0192607  6.0902 1.128e-09
dhl           0.1519248  0.0356379  4.2630 2.017e-05
parkings      0.0117739  0.0029482  3.9935 6.510e-05
cycleways     0.0135092  0.0044227  3.0545  0.002254
log(tot + 1)  0.1480035  0.0352868  4.1943 2.737e-05
parks        -0.2093795  0.0460939 -4.5425 5.560e-06
ups           0.3964271  0.1306332  3.0347  0.002408
dpd           0.2521028  0.1130940  2.2291  0.025804
fem_ratio    -0.0042107  0.0019415 -2.1688  0.030098
bus_stops     0.0287761  0.0228751  1.2580  0.208405

Rho: 0.17813
Asymptotic standard error: 0.073257
    z-value: 2.4315, p-value: 0.015035
Lambda: -0.13362
Asymptotic standard error: 0.12831
    z-value: -1.0414, p-value: 0.29771

LR test value: 6.4252, p-value: 0.040251

Log likelihood: -519.9551 for sac model
ML residual variance (sigma squared): 0.80336, (sigma: 0.8963)
Number of observations: 396 
Number of parameters estimated: 14 
AIC: 1067.9, (AIC for lm: 1070.3)
In [50]:
SDM <- errorsarlm(formula, data=df, listw=df.cont.listw, etype="mixed")
summary(SDM)
Call:errorsarlm(formula = formula, data = df, listw = df.cont.listw, 
    etype = "mixed")

Residuals:
      Min        1Q    Median        3Q       Max 
-3.760312 -0.365729 -0.075627  0.271674  3.603924 

Type: error 
Coefficients: (asymptotic standard errors) 
                    Estimate  Std. Error z value  Pr(>|z|)
(Intercept)      -0.17645482  0.42142109 -0.4187 0.6754253
schools           0.12092515  0.02199218  5.4986 3.829e-08
dhl               0.19673895  0.04426460  4.4446 8.805e-06
parkings          0.00583431  0.00360056  1.6204 0.1051483
cycleways         0.01647383  0.00484529  3.4000 0.0006739
log(tot + 1)      0.13436113  0.03792020  3.5433 0.0003952
parks            -0.17550184  0.04948822 -3.5463 0.0003906
ups               0.41637263  0.13400568  3.1071 0.0018892
dpd               0.22459114  0.11692782  1.9208 0.0547610
fem_ratio        -0.00368692  0.00193419 -1.9062 0.0566261
bus_stops         0.02828919  0.02562449  1.1040 0.2695973
lag.schools       0.00050302  0.05562900  0.0090 0.9927854
lag.dhl          -0.12002572  0.09986252 -1.2019 0.2293985
lag.parkings      0.01941050  0.00698256  2.7799 0.0054383
lag.cycleways    -0.00255865  0.01066416 -0.2399 0.8103848
lag.log(tot + 1)  0.06306240  0.07498315  0.8410 0.4003362
lag.parks        -0.13147963  0.12638399 -1.0403 0.2981919
lag.ups           0.66553839  0.50395021  1.3206 0.1866204
lag.dpd           0.13631876  0.39046925  0.3491 0.7270028
lag.fem_ratio    -0.00352377  0.00545605 -0.6458 0.5183792
lag.bus_stops     0.01432684  0.05960677  0.2404 0.8100544

Lambda: 0.054941, LR test value: 0.2927, p-value: 0.58849
Asymptotic standard error: 0.09206
    z-value: 0.5968, p-value: 0.55064
Wald statistic: 0.35617, p-value: 0.55064

Log likelihood: -513.9413 for error model
ML residual variance (sigma squared): 0.78455, (sigma: 0.88575)
Number of observations: 396 
Number of parameters estimated: 23 
AIC: 1073.9, (AIC for lm: 1072.2)
In [51]:
SDEM <- errorsarlm(formula, data=df, listw=df.cont.listw, etype="emixed")
summary(SDEM)
Call:errorsarlm(formula = formula, data = df, listw = df.cont.listw, 
    etype = "emixed")

Residuals:
      Min        1Q    Median        3Q       Max 
-3.760312 -0.365729 -0.075627  0.271674  3.603924 

Type: error 
Coefficients: (asymptotic standard errors) 
                    Estimate  Std. Error z value  Pr(>|z|)
(Intercept)      -0.17645482  0.42142109 -0.4187 0.6754253
schools           0.12092515  0.02199218  5.4986 3.829e-08
dhl               0.19673895  0.04426460  4.4446 8.805e-06
parkings          0.00583431  0.00360056  1.6204 0.1051483
cycleways         0.01647383  0.00484529  3.4000 0.0006739
log(tot + 1)      0.13436113  0.03792020  3.5433 0.0003952
parks            -0.17550184  0.04948822 -3.5463 0.0003906
ups               0.41637263  0.13400568  3.1071 0.0018892
dpd               0.22459114  0.11692782  1.9208 0.0547610
fem_ratio        -0.00368692  0.00193419 -1.9062 0.0566261
bus_stops         0.02828919  0.02562449  1.1040 0.2695973
lag.schools       0.00050302  0.05562900  0.0090 0.9927854
lag.dhl          -0.12002572  0.09986252 -1.2019 0.2293985
lag.parkings      0.01941050  0.00698256  2.7799 0.0054383
lag.cycleways    -0.00255865  0.01066416 -0.2399 0.8103848
lag.log(tot + 1)  0.06306240  0.07498315  0.8410 0.4003362
lag.parks        -0.13147963  0.12638399 -1.0403 0.2981919
lag.ups           0.66553839  0.50395021  1.3206 0.1866204
lag.dpd           0.13631876  0.39046925  0.3491 0.7270028
lag.fem_ratio    -0.00352377  0.00545605 -0.6458 0.5183792
lag.bus_stops     0.01432684  0.05960677  0.2404 0.8100544

Lambda: 0.054941, LR test value: 0.2927, p-value: 0.58849
Asymptotic standard error: 0.09206
    z-value: 0.5968, p-value: 0.55064
Wald statistic: 0.35617, p-value: 0.55064

Log likelihood: -513.9413 for error model
ML residual variance (sigma squared): 0.78455, (sigma: 0.88575)
Number of observations: 396 
Number of parameters estimated: 23 
AIC: 1073.9, (AIC for lm: 1072.2)
In [52]:
SEM <- errorsarlm(formula, data=df, listw=df.cont.listw)
summary(SEM)
Call:errorsarlm(formula = formula, data = df, listw = df.cont.listw)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.86442 -0.38388 -0.13033  0.28413  4.43792 

Type: error 
Coefficients: (asymptotic standard errors) 
               Estimate Std. Error z value  Pr(>|z|)
(Intercept)  -0.2513907  0.1825858 -1.3768   0.16856
schools       0.1222839  0.0199770  6.1212 9.285e-10
dhl           0.1744698  0.0362854  4.8083 1.522e-06
parkings      0.0122151  0.0030927  3.9497 7.826e-05
cycleways     0.0171933  0.0043202  3.9797 6.899e-05
log(tot + 1)  0.1715968  0.0348729  4.9206 8.627e-07
parks        -0.1926390  0.0468210 -4.1144 3.882e-05
ups           0.3558354  0.1300344  2.7365   0.00621
dpd           0.2611246  0.1137346  2.2959   0.02168
fem_ratio    -0.0043836  0.0019600 -2.2365   0.02532
bus_stops     0.0363045  0.0236105  1.5376   0.12414

Lambda: 0.051161, LR test value: 0.23035, p-value: 0.63126
Asymptotic standard error: 0.09221
    z-value: 0.55483, p-value: 0.57901
Wald statistic: 0.30783, p-value: 0.57901

Log likelihood: -523.0526 for error model
ML residual variance (sigma squared): 0.82154, (sigma: 0.90639)
Number of observations: 396 
Number of parameters estimated: 13 
AIC: 1072.1, (AIC for lm: 1070.3)
In [53]:
SAR <- lagsarlm(formula, data=df, listw=df.cont.listw)
summary(SAR)
Call:lagsarlm(formula = formula, data = df, listw = df.cont.listw)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.80128 -0.37166 -0.10260  0.24129  4.37771 

Type: lag 
Coefficients: (asymptotic standard errors) 
               Estimate Std. Error z value  Pr(>|z|)
(Intercept)  -0.2341478  0.1790217 -1.3079 0.1908972
schools       0.1189272  0.0197130  6.0329 1.610e-09
dhl           0.1613324  0.0362029  4.4563 8.337e-06
parkings      0.0113351  0.0030435  3.7244 0.0001958
cycleways     0.0145720  0.0044047  3.3083 0.0009386
log(tot + 1)  0.1515994  0.0352620  4.2992 1.714e-05
parks        -0.2043753  0.0465837 -4.3873 1.148e-05
ups           0.3763571  0.1298082  2.8993 0.0037396
dpd           0.2542829  0.1130757  2.2488 0.0245262
fem_ratio    -0.0041514  0.0019474 -2.1318 0.0330254
bus_stops     0.0302281  0.0233637  1.2938 0.1957335

Rho: 0.13838, LR test value: 5.3541, p-value: 0.020673
Asymptotic standard error: 0.057539
    z-value: 2.4049, p-value: 0.016177
Wald statistic: 5.7836, p-value: 0.016177

Log likelihood: -520.4907 for lag model
ML residual variance (sigma squared): 0.80902, (sigma: 0.89946)
Number of observations: 396 
Number of parameters estimated: 13 
AIC: 1067, (AIC for lm: 1070.3)
LM test for residual autocorrelation
test value: 0.99309, p-value: 0.31899
In [54]:
SLX <- lmSLX(formula, data=df, listw=df.cont.listw)
summary(SLX)
Call:
lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
    data = as.data.frame(x), weights = weights)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7586 -0.3556 -0.0735  0.2721  3.6347 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -0.176261   0.418246  -0.421 0.673684    
schools           0.120789   0.022690   5.323 1.76e-07 ***
dhl               0.195098   0.045651   4.274 2.44e-05 ***
parkings          0.005944   0.003725   1.596 0.111351    
cycleways         0.016424   0.004999   3.285 0.001115 ** 
log.tot...1.      0.135231   0.039192   3.450 0.000623 ***
parks            -0.176600   0.050968  -3.465 0.000592 ***
ups               0.416330   0.137141   3.036 0.002567 ** 
dpd               0.225329   0.119836   1.880 0.060841 .  
fem_ratio        -0.003700   0.001987  -1.862 0.063436 .  
bus_stops         0.028342   0.026456   1.071 0.284729    
lag.schools       0.003063   0.056190   0.055 0.956560    
lag.dhl          -0.116153   0.101077  -1.149 0.251223    
lag.parkings      0.019129   0.007078   2.703 0.007190 ** 
lag.cycleways    -0.002752   0.010772  -0.255 0.798504    
lag.log.tot...1.  0.058043   0.075474   0.769 0.442348    
lag.parks        -0.128221   0.127751  -1.004 0.316182    
lag.ups           0.653317   0.510612   1.279 0.201520    
lag.dpd           0.148930   0.394592   0.377 0.706069    
lag.fem_ratio    -0.003319   0.005468  -0.607 0.544229    
lag.bus_stops     0.012499   0.060475   0.207 0.836366    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9107 on 375 degrees of freedom
Multiple R-squared:  0.7398,	Adjusted R-squared:  0.726 
F-statistic: 53.32 on 20 and 375 DF,  p-value: < 2.2e-16

Again SAR model seems to be the best in case of spatial terms significance and number of relevant variables! For Cracow BIC and AIC results also indicates SAR as the best model!

In [55]:
cat("ols",BIC(ols) ,'\n')
cat("GNS",BIC(GNS),'\n')
cat("SAC",BIC(SAC),'\n')
cat("SDM",BIC(SDM),'\n')
cat("SDEM",BIC(SDEM),'\n')
cat("SAR",BIC(SAR),'\n')
cat("SLX",BIC(SLX),'\n')
cat("SEM",BIC(SEM),'\n')
cat('\n')
cat("ols",AIC(ols) ,'\n')
cat("GNS",AIC(GNS),'\n')
cat("SAC",AIC(SAC),'\n')
cat("SDM",AIC(SDM),'\n')
cat("SDEM",AIC(SDEM),'\n')
cat("SAR",AIC(SAR),'\n')
cat("SLX",AIC(SLX),'\n')
cat("SEM",AIC(SEM),'\n')
ols 1118.112 
GNS 1171.356 
SAC 1123.65 
SDM 1165.455 
SDEM 1165.455 
SAR 1118.74 
SLX 1159.766 
SEM 1123.863 

ols 1070.335 
GNS 1075.803 
SAC 1067.91 
SDM 1073.883 
SDEM 1073.883 
SAR 1066.981 
SLX 1072.175 
SEM 1072.105 

Now we looked for better formula for SAR and we found it in the general to specific procedure.

In [67]:
formula_max = inpost ~ dhl + dpd + fedex + ruch + ups +
            log(tot + 1) + fem_ratio + 
            buildings + shops + parks + forests + schools + 
            railways + cycleways + parkings + crossings + bus_stops

formula_gen_to_specifc = inpost ~ dhl + dpd + ruch + ups +
            log(tot + 1) + fem_ratio + 
            parks + schools + 
            cycleways + parkings

SAR_g2s<- lagsarlm(formula_gen_to_specifc, data=df, listw=df.cont.listw)
summary(SAR_g2s) #the same as previously!
Call:
lagsarlm(formula = formula_gen_to_specifc, data = df, listw = df.cont.listw)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.71682 -0.36660 -0.09821  0.25738  4.45771 

Type: lag 
Coefficients: (asymptotic standard errors) 
               Estimate Std. Error z value  Pr(>|z|)
(Intercept)  -0.2358848  0.1789241 -1.3184 0.1873862
dhl           0.1474649  0.0390720  3.7742 0.0001605
dpd           0.2694286  0.1125832  2.3932 0.0167044
ruch          0.1777519  0.1225243  1.4507 0.1468500
ups           0.3238056  0.1339810  2.4168 0.0156575
log(tot + 1)  0.1531382  0.0352776  4.3409 1.419e-05
fem_ratio    -0.0041838  0.0019456 -2.1504 0.0315219
parks        -0.2068055  0.0466168 -4.4363 9.153e-06
schools       0.1132504  0.0203474  5.5658 2.609e-08
cycleways     0.0160661  0.0042884  3.7464 0.0001794
parkings      0.0124737  0.0030467  4.0941 4.238e-05

Rho: 0.14868, LR test value: 6.2791, p-value: 0.012217
Asymptotic standard error: 0.057206
    z-value: 2.599, p-value: 0.0093501
Wald statistic: 6.7547, p-value: 0.0093501

Log likelihood: -520.2749 for lag model
ML residual variance (sigma squared): 0.80779, (sigma: 0.89877)
Number of observations: 396 
Number of parameters estimated: 13 
AIC: 1066.5, (AIC for lm: 1070.8)
LM test for residual autocorrelation
test value: 1.0503, p-value: 0.30544

Direct and indirect impacts

Now we move to the statistical inference part.

Let's start with logistic companies variables. We see that either in case of direct and indirect impacts all parameters for logistic companies (apart from Ruch total, and UPS indirect) are significant (at least around 10%). Internalization effects are greater than spill-over, but still spill-over effects are significant. In case of demographic variables we claim that total population density is important variable for the model in case of direct and indirect impact (has positive sign). Feminization ratio in case of direct impact is also significant and has negative sign (it is quite non intuitive). In case of point of interest data we see that number of schools, number of parks, number of cycleways and number of parkings are significant in the model with respectively positive, negative positive, and positive sign. We claim that it is intuitive.

In [71]:
# distribution of total impact 
# vector of traces of powers of a spatial weights matrix
# converting censored listw to CsparseMatrix form
W.c <- as(as_dgRMatrix_listw(df.cont.listw), "CsparseMatrix") 
# the default values for the number of powers is 30
trMat<-trW(W.c, type="mult") 

SAR_imp<-impacts(SAR_g2s, tr=trMat, R=200)
summary(SAR_imp, zstats=TRUE, short=TRUE)
Impact measures (lag, trace):
                   Direct      Indirect        Total
dhl           0.147953001  0.0252655979  0.173218599
dpd           0.270320307  0.0461619849  0.316482292
ruch          0.178340242  0.0304547581  0.208795000
ups           0.324877326  0.0554785630  0.380355889
log(tot + 1)  0.153644998  0.0262376074  0.179882606
fem_ratio    -0.004197681 -0.0007168284 -0.004914509
parks        -0.207489962 -0.0354325896 -0.242922552
schools       0.113625255  0.0194035267  0.133028781
cycleways     0.016119260  0.0027526493  0.018871909
parkings      0.012514984  0.0021371554  0.014652139
========================================================
Simulation results ( variance matrix):
========================================================
Simulated standard errors
                  Direct     Indirect       Total
dhl          0.041731285 0.0130662719 0.048890019
dpd          0.110718063 0.0287868162 0.130905207
ruch         0.128213530 0.0284477403 0.151676217
ups          0.139900865 0.0365312845 0.167022198
log(tot + 1) 0.033852240 0.0123982027 0.039850455
fem_ratio    0.001985456 0.0005281447 0.002395458
parks        0.042365144 0.0183702719 0.053278416
schools      0.019925685 0.0094147423 0.025152062
cycleways    0.004677447 0.0014080831 0.005410368
parkings     0.003082116 0.0010750601 0.003681574

Simulated z-values:
                Direct  Indirect     Total
dhl           3.540222  2.047906  3.569165
dpd           2.439148  1.707472  2.438485
ruch          1.457977  1.222888  1.461804
ups           2.247846  1.591503  2.230932
log(tot + 1)  4.540161  2.249261  4.556571
fem_ratio    -2.248551 -1.589268 -2.214091
parks        -4.844420 -2.066040 -4.564480
schools       5.595339  2.186592  5.251147
cycleways     3.539126  2.109490  3.608703
parkings      3.947842  2.057703  3.905900

Simulated p-values:
             Direct     Indirect Total     
dhl          0.00039979 0.040569 0.00035812
dpd          0.01472195 0.087734 0.01474897
ruch         0.14484675 0.221372 0.14379506
ups          0.02458603 0.111497 0.02568562
log(tot + 1) 5.6211e-06 0.024496 5.1996e-06
fem_ratio    0.02454109 0.112000 0.02682251
parks        1.2698e-06 0.038825 5.0073e-06
schools      2.2019e-08 0.028772 1.5116e-07
cycleways    0.00040145 0.034902 0.00030773
parkings     7.8859e-05 0.039619 9.3875e-05

Conclusions from spatial dependence models

We combine our statistical inference for Warsaw and Cracow. Base of above research we can generally claim that InPost has deployed parcel pick-up points in line with the competitors - as the number of InPost pick-up points increases, the number of competitors' pick-up points increases (ignoring competitors not significant in the regression). We demonstrate that our control variables from the demographic group and points of interest are also significant, so InPost follows modeling logic and literature guidance for instance such variables are positively related with number of InPost's pick-up points: number of cycleways, number of parkings, number of schools, number of crossings. As it turned out a significant spatial effect in multivariate econometric models is Spatial Lag. It is quite intuitive that spillover effect exists for this problem.

8. Spatial drift models

Import dependencies

In [245]:
library(spdep)
library(rgdal)
library(maptools)
library(sp)
library(RColorBrewer)
library(classInt)
library(GISTools)
library(maps)
library(spgwr)
library(factoextra)
library(NbClust)
library(repr)
library(stargazer)
options(warn = -1)

In this part of the study we apply Geographically Weighted Regression (GWR) method based on Mueller et al. (2013). This research uses clusterization methods.

Contrary to the previous models, we will not use data for 1 km2 grids, but only for inpost points. Thus, our points on the map will be all inpost machines for Warszawa and Kraków. For the variables for given points, we decided to create 500 meters buffers around a given point, for which: we count the operator's points and points of interest within this range, and we also collected data for grids (if more than one grid coincided with the buffer, then we took the average for grid data). As the target variable, we chose the number of inpost parcel machines around each Inpost point.

Analysis for Warszawa

In [210]:
df_points_Warszawa<-readOGR(".", "df_warszawa")
OGR data source with driver: ESRI Shapefile 
Source: "C:\Users\Michal_schudnij\Desktop\Spatial_eco\repo\codes", layer: "df_warszawa"
with 938 features
It has 49 fields
Integer64 fields read as strings:  inpost_poi 
In [211]:
crds<-coordinates(df_points_Warszawa)
In [212]:
names(df_points_Warszawa)[3] <- 'inpost'
names(df_points_Warszawa)[4] <- 'poczta'
names(df_points_Warszawa)[5] <- 'dhl'
names(df_points_Warszawa)[6] <- 'ruch'
names(df_points_Warszawa)[7] <- 'dpd'
names(df_points_Warszawa)[8] <- 'ups'
names(df_points_Warszawa)[22] <- 'fem_ratio'
names(df_points_Warszawa)[36] <- 'tot_log'
names(df_points_Warszawa)[44] <- 'male_65_log'
In [213]:
df_Warszawa <- as.data.frame(df_points_Warszawa)
df_Warszawa$inpost <- as.numeric(df_Warszawa$inpost)

Basic plot for all Inpost points in Warszawa

In [214]:
options(repr.plot.width=4, repr.plot.height=4)
ggplot() + geom_point(data=df_Warszawa, aes(long, lat), size = 2.5, alpha = 0.7)

Model equation for GWR

We do not include variables with too high correlation. According to non spatial explanatory analysis we remove most of the demographic variables.

In [215]:
# equation
eq <- inpost ~ poczta + dhl + dpd + ruch + ups + 
            buildings + parks + forests + schools + railways + cycleways + parkings + crossings + 
            tot_log + male_65_log + fem_ratio

Selection of optimum bandwith

Commented out due to the time of execution.

In [155]:
# optimum bandwidth
# bw <- ggwr.sel(eq, data=df_Warszawa, coords=crds, family=poisson(), longlat=TRUE)
Bandwidth: 14.89077 CV score: 10297.78 
Bandwidth: 24.06972 CV score: 10387.11 
Bandwidth: 9.217872 CV score: 10110.47 
Bandwidth: 5.711825 CV score: 9772.493 
Bandwidth: 3.544969 CV score: 9274.785 
Bandwidth: 2.205778 CV score: 8763.745 
Bandwidth: 1.378112 CV score: 18911.88 
Bandwidth: 2.717303 CV score: 8921.138 
Bandwidth: 1.889638 CV score: 8939.769 
Bandwidth: 2.314246 CV score: 8775.789 
Bandwidth: 2.146219 CV score: 8767.203 
Bandwidth: 2.204847 CV score: 8763.735 
Bandwidth: 2.201018 CV score: 8763.717 
Bandwidth: 2.20067 CV score: 8763.717 
Bandwidth: 2.200711 CV score: 8763.717 
Bandwidth: 2.200629 CV score: 8763.717 
Bandwidth: 2.20067 CV score: 8763.717 
In [218]:
bw_Warszawa = 2.20067014501355

GWR model

In [219]:
# GWR model
model.ggwr_Warszawa<-ggwr(eq, data=df_Warszawa, coords=crds, family=poisson(), longlat=TRUE, bandwidth=bw_Warszawa)
model.ggwr_Warszawa
Call:
ggwr(formula = eq, data = df_Warszawa, coords = crds, bandwidth = bw_Warszawa, 
    family = poisson(), longlat = TRUE)
Kernel function: gwr.Gauss 
Fixed bandwidth: 2.20067 
Summary of GWR coefficient estimates at data points:
                    Min.     1st Qu.      Median     3rd Qu.        Max.
X.Intercept. -5.40384081  0.58759871  1.03740602  1.95221534  5.67025383
poczta       -0.15594598 -0.02042205 -0.01377449  0.00056021  0.32837003
dhl          -0.30763314 -0.00277764  0.01993089  0.02888271  0.27561320
dpd          -0.18407471 -0.00988317  0.00461828  0.03625983  0.22978494
ruch         -0.21374910 -0.01714193  0.03190855  0.05855257  0.13106994
ups          -0.21635700 -0.01479274  0.01435957  0.08409953  0.65641568
buildings    -0.00765355 -0.00032297  0.00022073  0.00078364  0.02662756
parks        -0.66805712 -0.03123343 -0.01582486 -0.00159138  0.18623359
forests      -0.10656564 -0.00409090 -0.00073454  0.00372430  0.02922877
schools      -0.19026278 -0.00053690  0.00408094  0.01003514  0.10767704
railways     -0.11421782 -0.00263582 -0.00031269  0.00135871  0.02493791
cycleways    -0.04109024 -0.00436540 -0.00341032 -0.00098341  0.05082756
parkings     -0.06290399 -0.00166342 -0.00017565  0.00164858  0.02189712
crossings    -0.00677032 -0.00041061  0.00066429  0.00285464  0.03337199
tot_log      -1.59106171 -0.14781534  0.08511758  0.20250356  0.94311445
male_65_log  -0.82628402 -0.04730738  0.08900481  0.21539295  1.20918518
fem_ratio    -0.01915643 -0.00183401  0.00110957  0.00519756  0.05265112
              Global
X.Intercept.  0.5150
poczta       -0.0123
dhl           0.0135
dpd           0.0012
ruch          0.0321
ups           0.0238
buildings     0.0004
parks        -0.0084
forests       0.0004
schools       0.0028
railways     -0.0007
cycleways    -0.0033
parkings     -0.0004
crossings     0.0005
tot_log       0.1907
male_65_log  -0.0346
fem_ratio     0.0019
In [220]:
features <- c('poczta', 'dhl', 'dpd', 'ruch', 'ups', 'buildings', 'parks', 
    'forests', 'schools', 'railways', 'cycleways', 'parkings', 'crossings', 
    'tot_log', 'male_65_log', 'fem_ratio')

Visualisation of GWR output for Warszawa

In [226]:
options(repr.plot.width=5, repr.plot.height=3)
layout(matrix(1:16, 1, 16))

for (i in features){
    g <- ggplot() + 
    geom_point(data = df_Warszawa, aes(long, lat, colour = model.ggwr_Warszawa$SDF[[i]]), size = 3, alpha = 0.7) +
    labs(title = paste("GWR for:", i, "variable")) +
    guides(colour =  guide_legend(title = "GWR coeff"))
    print(g)
}

Taking into account the above visualizations, it should be noted that spatial differences between different areas in Warsaw are imperceptible for most of the variables. In the case of the coefficients for the operators: for Poczta Polska, DHL, DPD and UPS, the coefficients are rather negative or zero, while for Ruch they are rather positive. Regarding points of interest, most of the coefficients hover around zero. In the case of demographic variables, there is a variable total population, for which the ratios are mostly positive, while for the population of men over 65, this variable has a negative ratio.

Optimal number of clusters for Warszawa

In [165]:
# clustering of GWR coefficients
fviz_nbclust(as.data.frame(model.ggwr_Warszawa$SDF[,2:18]), FUNcluster=kmeans)

Despite the fact that the figure above shows the optimal number of 2 clusters, we decide to choose 5 clusters, due to the slightly lower average silhouette and more interesting analysis than for 2 clusters only.

In [227]:
clusters1<-kmeans(as.data.frame(model.ggwr_Warszawa$SDF[,2:18]), 5) 

options(repr.plot.width=5, repr.plot.height=3)

ggplot(data=df_Warszawa, aes(long, lat)) + geom_point(aes(colour = as.factor(clusters1$cluster)), size = 2.5, alpha = 0.7) +
labs(title = '5 clusters, results from kmeans()')

In the figure above, we can see a clear division into 2 central clusters (second and fourth) and 3 smaller clusters on the outskirts of Warsaw (first, third and fifth cluster).

In [228]:
df_Warszawa$clust1<-rep(0, times=dim(df_Warszawa)[1])
df_Warszawa$clust1[clusters1$cluster==1]<-1
df_Warszawa$clust2<-rep(0, times=dim(df_Warszawa)[1])
df_Warszawa$clust2[clusters1$cluster==2]<-1
df_Warszawa$clust3<-rep(0, times=dim(df_Warszawa)[1])
df_Warszawa$clust3[clusters1$cluster==3]<-1
df_Warszawa$clust4<-rep(0, times=dim(df_Warszawa)[1])
df_Warszawa$clust4[clusters1$cluster==4]<-1
df_Warszawa$clust5<-rep(0, times=dim(df_Warszawa)[1])
df_Warszawa$clust5[clusters1$cluster==5]<-1

Equation for a-spatial linear model

In [174]:
eq1 <- inpost ~ poczta + dhl + dpd + ruch + ups + 
            buildings + parks + forests + schools + railways + cycleways + parkings + crossings + 
            tot_log + male_65_log + fem_ratio +  clust1 + clust2 + clust3 + clust4

a-spatial linear model with full specification

In [175]:
model.ols_Warszawa<-lm(eq1, data=df_Warszawa)
summary(model.ols_Warszawa)
Call:
lm(formula = eq1, data = df_Warszawa)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8505 -0.8267  0.6386  1.9559  6.9644 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.025931   2.162413  -1.862 0.062954 .  
poczta      -0.099337   0.075912  -1.309 0.191004    
dhl          0.114894   0.066974   1.715 0.086592 .  
dpd          0.021883   0.104635   0.209 0.834387    
ruch         0.334948   0.114142   2.934 0.003424 ** 
ups          0.243012   0.149803   1.622 0.105101    
buildings    0.003823   0.002598   1.472 0.141476    
parks       -0.073882   0.027474  -2.689 0.007293 ** 
forests      0.007963   0.016308   0.488 0.625463    
schools      0.031139   0.026383   1.180 0.238193    
railways    -0.005929   0.006201  -0.956 0.339274    
cycleways   -0.031289   0.009675  -3.234 0.001265 ** 
parkings    -0.002102   0.005276  -0.398 0.690397    
crossings    0.004946   0.004705   1.051 0.293453    
tot_log      1.429668   0.388484   3.680 0.000247 ***
male_65_log -0.263898   0.309958  -0.851 0.394769    
fem_ratio    0.017118   0.013088   1.308 0.191241    
clust1      -0.431690   0.562488  -0.767 0.443002    
clust2      -0.234934   0.457335  -0.514 0.607584    
clust3      -0.166687   0.666269  -0.250 0.802505    
clust4       0.046876   0.430626   0.109 0.913340    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.288 on 917 degrees of freedom
Multiple R-squared:  0.1918,	Adjusted R-squared:  0.1742 
F-statistic: 10.88 on 20 and 917 DF,  p-value: < 2.2e-16

2nd equation for a-spatial linear model without insignificant variables (except clusters dummies)

Due to the large number of insignificant variables, we eliminated these variables (except clusters dummies).

In [229]:
eq2 <- inpost ~ ruch + ups + 
            parks + cycleways + 
            tot_log +  clust1 + clust2 + clust3 + clust4

Final a-spatial linear model

In [230]:
model.ols_Warszawa2<-lm(eq2, data=df_Warszawa)
summary(model.ols_Warszawa2)
Call:
lm(formula = eq2, data = df_Warszawa)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.9776 -0.7694  0.6371  1.9826  6.9376 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.65204    1.15406  -2.298 0.021783 *  
ruch         0.36906    0.10048   3.673 0.000253 ***
ups          0.27289    0.14284   1.911 0.056373 .  
parks       -0.05472    0.02409  -2.272 0.023323 *  
cycleways   -0.02125    0.00789  -2.693 0.007208 ** 
tot_log      1.34984    0.14648   9.215  < 2e-16 ***
clust1      -0.14462    0.68946  -0.210 0.833906    
clust2       0.21906    0.57300   0.382 0.702326    
clust3       0.23150    0.62060   0.373 0.709215    
clust4      -0.10192    0.60157  -0.169 0.865504    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.287 on 928 degrees of freedom
Multiple R-squared:  0.1826,	Adjusted R-squared:  0.1746 
F-statistic: 23.03 on 9 and 928 DF,  p-value: < 2.2e-16

Adjusted R2 for "Warszawa model" is equal to 17.5%. It is relatively low value. The reason for that could be probably irregular location of Inpost parcels machines in Warszawa. Regarding this model it is important that all variables are jointly significant. Analyzing the coefficients, it should be noted, that all coefficients for clusters dummies are insignificant - in this case division to the clusters is not needed (homogeneity in Warsaw). In the case of positive significant coefficients: the more Ruch and UPS points and the higher the total population, the more Inpost points. Considering negative significant variables: more parks and cycleways cause the lower number of Inpost points.

Analysis for Kraków

In [231]:
df_points_Krakow<-readOGR(".", "df_krakow")
OGR data source with driver: ESRI Shapefile 
Source: "C:\Users\Michal_schudnij\Desktop\Spatial_eco\repo\codes", layer: "df_krakow"
with 388 features
It has 49 fields
Integer64 fields read as strings:  inpost_poi 
In [232]:
crds<-coordinates(df_points_Krakow)
In [233]:
names(df_points_Krakow)[3] <- 'inpost'
names(df_points_Krakow)[4] <- 'poczta'
names(df_points_Krakow)[5] <- 'dhl'
names(df_points_Krakow)[6] <- 'ruch'
names(df_points_Krakow)[7] <- 'dpd'
names(df_points_Krakow)[8] <- 'ups'
names(df_points_Krakow)[22] <- 'fem_ratio'
names(df_points_Krakow)[36] <- 'tot_log'
names(df_points_Krakow)[44] <- 'male_65_log'
In [234]:
df_Krakow <- as.data.frame(df_points_Krakow)
df_Krakow$inpost <- as.numeric(df_Krakow$inpost)

Basic plot for all Inpost points in Kraków

In [235]:
options(repr.plot.width=4, repr.plot.height=4)
ggplot() + geom_point(data=df_Krakow, aes(long, lat), size = 2.5, alpha = 0.7)

Model equation for GWR

The same set of variables as for Warszawa

In [236]:
# equation
eq <- inpost ~ poczta + dhl + dpd + ruch + ups + 
            buildings + parks + forests + schools + railways + cycleways + parkings + crossings + 
            tot_log + male_65_log + fem_ratio

Selection of optimum bandwith

Commented out due to the time of execution.

In [184]:
# # optimum bandwidth
# bw_Krakow <- ggwr.sel(eq, data=df_Krakow, coords=crds, family=poisson(), longlat=TRUE)
Bandwidth: 10.5784 CV score: 753.5725 
Bandwidth: 17.09913 CV score: 763.6016 
Bandwidth: 6.548374 CV score: 729.3078 
Bandwidth: 4.057679 CV score: 675.5492 
Bandwidth: 2.518345 CV score: 612.6373 
Bandwidth: 1.566984 CV score: 590.8475 
Bandwidth: 0.45501 CV score: 1505370 
Bandwidth: 1.142248 CV score: 603.5507 
Bandwidth: 1.744269 CV score: 594.3376 
Bandwidth: 1.53614 CV score: 590.1974 
Bandwidth: 1.385687 CV score: 588.8676 
Bandwidth: 1.395428 CV score: 588.7994 
Bandwidth: 1.421648 CV score: 588.744 
Bandwidth: 1.465381 CV score: 589.0258 
Bandwidth: 1.41632 CV score: 588.7406 
Bandwidth: 1.416594 CV score: 588.7406 
Bandwidth: 1.416531 CV score: 588.7406 
Bandwidth: 1.41649 CV score: 588.7406 
Bandwidth: 1.416531 CV score: 588.7406 
In [237]:
bw_Krakow = 1.41653116071638

GWR model

In [238]:
model.ggwr_Krakow<-ggwr(eq, data=df_Krakow, coords=crds, family=poisson(), longlat=TRUE, bandwidth=bw_Krakow)
model.ggwr_Krakow
Call:
ggwr(formula = eq, data = df_Krakow, coords = crds, bandwidth = bw_Krakow, 
    family = poisson(), longlat = TRUE)
Kernel function: gwr.Gauss 
Fixed bandwidth: 1.416531 
Summary of GWR coefficient estimates at data points:
                    Min.     1st Qu.      Median     3rd Qu.        Max.
X.Intercept. -1.1424e+01 -5.6233e+00 -3.4161e+00 -1.7161e-01  6.2185e+00
poczta       -4.7856e-01 -7.3706e-02 -2.1077e-02  7.2386e-03  1.2552e-01
dhl          -1.7738e-01 -3.3733e-03  2.9753e-02  5.9415e-02  6.8386e-01
dpd          -7.4700e-01 -1.1104e-01 -5.4823e-02  8.1547e-02  6.0352e-01
ruch         -2.6993e-01 -2.1470e-02  1.3083e-01  1.9562e-01  5.3755e-01
ups          -9.2996e-01  3.2486e-03  6.5286e-02  1.2378e-01  4.8286e-01
buildings    -1.8572e-02 -1.9911e-03 -7.2061e-04  8.4901e-05  2.0226e-02
parks        -2.1300e-01 -1.1900e-01 -8.5388e-02 -5.5236e-02  2.1936e-01
forests      -7.4076e-02 -2.1885e-02 -9.6270e-03 -1.9117e-03  5.3955e-02
schools      -7.9674e-02 -2.3819e-02 -5.5960e-03  8.6061e-03  6.9566e-02
railways     -4.6448e-02 -2.5616e-03  7.2440e-04  4.3198e-03  3.3464e-02
cycleways    -1.0555e-02 -1.0057e-03  1.7514e-03  3.2171e-03  3.0754e-02
parkings     -1.4980e-02  7.2562e-04  2.3847e-03  3.9087e-03  3.4277e-02
crossings    -1.9648e-02 -2.6866e-04  2.0802e-03  4.2313e-03  1.5818e-02
tot_log      -1.0980e+00  2.8034e-01  5.5004e-01  8.4958e-01  2.1858e+00
male_65_log  -2.5975e+00 -5.2830e-01 -2.3640e-01  5.0413e-02  1.3728e+00
fem_ratio    -7.8620e-02 -1.1985e-02  8.3191e-03  2.2638e-02  6.5229e-02
              Global
X.Intercept. -3.2809
poczta       -0.0245
dhl           0.0274
dpd          -0.0291
ruch          0.1062
ups           0.0651
buildings    -0.0002
parks        -0.0820
forests      -0.0087
schools      -0.0111
railways      0.0008
cycleways     0.0029
parkings      0.0017
crossings     0.0013
tot_log       0.7197
male_65_log  -0.3685
fem_ratio     0.0039
In [159]:
features <- c('poczta', 'dhl', 'dpd', 'ruch', 'ups', 'buildings', 'parks', 
    'forests', 'schools', 'railways', 'cycleways', 'parkings', 'crossings', 
    'tot_log', 'male_65_log', 'fem_ratio')

Visualisation of GWR output for Kraków

In [239]:
options(repr.plot.width=5, repr.plot.height=3)
layout(matrix(1:16, 1, 16))

for (i in features){

    
    g <- ggplot() + 
    geom_point(data = df_Krakow, aes(long, lat, colour = model.ggwr_Krakow$SDF[[i]]), size = 3, alpha = 0.7) +
    labs(title = paste("GWR for:", i, "variable")) +
    guides(colour =  guide_legend(title = "GWR coeff"))
    print(g)
}

Considering the above visualizations for Kraków, it should be noted that we have greater spatial variability between different areas than in Warszawa. In the case of coefficients for operators: most of the coefficients are around zero, however there are areas with positive or negative coefficients. Regarding points of interest, variables defining parks and forests are mostly negative, for the rest points of interest coefficients are around zero. In the case of demographic variables, again, as in the case of Warsaw, for total population, the coefficients are mostly positive (however with some 'negative' areas), while for the population of men over 65, this variable has a positive value for relatively many points.

Optimal number of clusters for Kraków

In [240]:
# clustering of GWR coefficients
fviz_nbclust(as.data.frame(model.ggwr_Krakow$SDF[,2:18]), FUNcluster=kmeans)

Despite the fact that the figure above shows the optimal number of 2 clusters, we decide to choose 4 clusters, due to the slightly lower average silhouette and more interesting analysis than for 2 clusters only.

In [241]:
clusters1<-kmeans(as.data.frame(model.ggwr_Krakow$SDF[,2:18]), 4) 

options(repr.plot.width=5, repr.plot.height=3)

ggplot(data=df_Krakow, aes(long, lat)) + geom_point(aes(colour = as.factor(clusters1$cluster)), size = 2.5, alpha = 0.7) +
labs(title = '4 clusters, results from kmeans()')

In the figure above, we can see a division into 2 main vertical clusters (second and third) and 2 smaller clusters (first and fourth cluster) located on the outskirts of Krakow.

In [243]:
df_Krakow$clust1<-rep(0, times=dim(df_Krakow)[1])
df_Krakow$clust1[clusters1$cluster==1]<-1
df_Krakow$clust2<-rep(0, times=dim(df_Krakow)[1])
df_Krakow$clust2[clusters1$cluster==2]<-1
df_Krakow$clust3<-rep(0, times=dim(df_Krakow)[1])
df_Krakow$clust3[clusters1$cluster==3]<-1
df_Krakow$clust4<-rep(0, times=dim(df_Krakow)[1])
df_Krakow$clust4[clusters1$cluster==4]<-1

Equation for a-spatial linear model

In [192]:
eq1 <- inpost ~ poczta + dhl + dpd + ruch + ups + 
            buildings + parks + forests + schools + railways + cycleways + parkings + crossings + 
            tot_log + male_65_log + fem_ratio + clust1 + clust2 + clust3

a-spatial linear model with full specification

In [193]:
model.ols_Krakow<-lm(eq1, data=df_Krakow)
summary(model.ols_Krakow)
Call:
lm(formula = eq1, data = df_Krakow)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7373 -0.9511 -0.1385  0.8100  3.5824 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.3110028  1.7920716  -1.848 0.065465 .  
poczta      -0.0339122  0.0618983  -0.548 0.584114    
dhl          0.0734430  0.0655283   1.121 0.263111    
dpd         -0.0428689  0.1079344  -0.397 0.691468    
ruch         0.3984424  0.0977392   4.077 5.60e-05 ***
ups          0.3714112  0.1069046   3.474 0.000573 ***
buildings   -0.0004254  0.0015899  -0.268 0.789163    
parks       -0.2763362  0.0501453  -5.511 6.73e-08 ***
forests     -0.0243780  0.0137741  -1.770 0.077581 .  
schools     -0.0580050  0.0242127  -2.396 0.017090 *  
railways    -0.0028033  0.0048236  -0.581 0.561477    
cycleways    0.0096260  0.0040053   2.403 0.016743 *  
parkings     0.0085784  0.0033336   2.573 0.010463 *  
crossings    0.0085397  0.0042668   2.001 0.046078 *  
tot_log      1.0011421  0.3350050   2.988 0.002992 ** 
male_65_log -0.2586526  0.2913245  -0.888 0.375201    
fem_ratio   -0.0049580  0.0134370  -0.369 0.712354    
clust1       0.5684167  0.2710271   2.097 0.036652 *  
clust2      -0.4520825  0.2245136  -2.014 0.044778 *  
clust3       0.1041434  0.2003245   0.520 0.603464    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.366 on 368 degrees of freedom
Multiple R-squared:  0.4327,	Adjusted R-squared:  0.4034 
F-statistic: 14.77 on 19 and 368 DF,  p-value: < 2.2e-16

2nd equation for a-spatial linear model without insignificant variables

Due to the quite large number of insignificant variables, we eliminated these variables.

In [194]:
eq2<-inpost ~ ruch + ups + 
            parks + forests + schools + cycleways + parkings + crossings + 
            tot_log + clust1 + clust2 + clust3

Final a-spatial linear model with 2nd specification

In [195]:
# a-spatial linear model
model.ols_Krakow2<-lm(eq2, data=df_Krakow)
summary(model.ols_Krakow2)
Call:
lm(formula = eq2, data = df_Krakow)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7190 -0.8983 -0.1202  0.7876  3.6214 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.818353   0.789262  -3.571 0.000402 ***
ruch         0.390744   0.091395   4.275 2.42e-05 ***
ups          0.395793   0.101731   3.891 0.000118 ***
parks       -0.264951   0.048524  -5.460 8.67e-08 ***
forests     -0.022671   0.013396  -1.692 0.091411 .  
schools     -0.068556   0.022153  -3.095 0.002118 ** 
cycleways    0.008730   0.003825   2.282 0.023044 *  
parkings     0.010665   0.002894   3.685 0.000262 ***
crossings    0.009355   0.003407   2.746 0.006330 ** 
tot_log      0.704860   0.103804   6.790 4.40e-11 ***
clust1       0.676982   0.244864   2.765 0.005978 ** 
clust2      -0.494152   0.205339  -2.407 0.016588 *  
clust3       0.067014   0.183911   0.364 0.715780    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.36 on 375 degrees of freedom
Multiple R-squared:  0.4275,	Adjusted R-squared:  0.4092 
F-statistic: 23.34 on 12 and 375 DF,  p-value: < 2.2e-16

Adjusted R2 for "Kraków model" is equal to 41%. Comparing this result to model for Warszawa, we can say that distribution of Kraków Inpost machines is definitely more explainable than in Warszawa. Of course, all variables are also jointly significant. In contrast to Warszawa, clusters division for Kraków is definitely significant. The number of Inpost points is higher in cluster 1 (observations in the center of Kraków) than in cluster 4 (observations on the outskirts of Kraków), while the number of inpost points is lower in cluster 2 (points located on the north west of Kraków) than in cluster 4 (observations on the outskirts of Kraków). Estimate for cluster 3 (center of Kraków) is insignificant (no difference between third and fourth cluster). We can see definitely heterogeneity of this phenomenon in Kraków. Regarding other variables, absolutely more variables are significant than in the case of Warszawa. The greater number of Ruch and UPS points indicates a greater number of Inpost points. In the case of points of interest, we can see that the greater number of cycleways, parkings and crossings also increases the number of Inpost points, while the number of green areas (parks and forests) and schools reduces the number of Inpost points. Finally, the only demographic variable that is significant is total population. As expected, the more people, the larger the dependent variable.

Comparison of models and conclusions

In [248]:
stargazer(model.ols_Warszawa, model.ols_Warszawa2, model.ols_Krakow, model.ols_Krakow2, type = 'text')
======================================================================================================================
                                                           Dependent variable:                                        
                    --------------------------------------------------------------------------------------------------
                                                                  inpost                                              
                              (1)                      (2)                     (3)                      (4)           
----------------------------------------------------------------------------------------------------------------------
poczta                       -0.099                                           -0.034                                  
                            (0.076)                                          (0.062)                                  
                                                                                                                      
dhl                          0.115*                                           0.073                                   
                            (0.067)                                          (0.066)                                  
                                                                                                                      
dpd                          0.022                                            -0.043                                  
                            (0.105)                                          (0.108)                                  
                                                                                                                      
ruch                        0.335***                0.369***                 0.398***                 0.391***        
                            (0.114)                  (0.100)                 (0.098)                  (0.091)         
                                                                                                                      
ups                          0.243                   0.273*                  0.371***                 0.396***        
                            (0.150)                  (0.143)                 (0.107)                  (0.102)         
                                                                                                                      
buildings                    0.004                                           -0.0004                                  
                            (0.003)                                          (0.002)                                  
                                                                                                                      
parks                      -0.074***                -0.055**                -0.276***                -0.265***        
                            (0.027)                  (0.024)                 (0.050)                  (0.049)         
                                                                                                                      
forests                      0.008                                           -0.024*                  -0.023*         
                            (0.016)                                          (0.014)                  (0.013)         
                                                                                                                      
schools                      0.031                                           -0.058**                -0.069***        
                            (0.026)                                          (0.024)                  (0.022)         
                                                                                                                      
railways                     -0.006                                           -0.003                                  
                            (0.006)                                          (0.005)                                  
                                                                                                                      
cycleways                  -0.031***                -0.021***                0.010**                  0.009**         
                            (0.010)                  (0.008)                 (0.004)                  (0.004)         
                                                                                                                      
parkings                     -0.002                                          0.009**                  0.011***        
                            (0.005)                                          (0.003)                  (0.003)         
                                                                                                                      
crossings                    0.005                                           0.009**                  0.009***        
                            (0.005)                                          (0.004)                  (0.003)         
                                                                                                                      
tot_log                     1.430***                1.350***                 1.001***                 0.705***        
                            (0.388)                  (0.146)                 (0.335)                  (0.104)         
                                                                                                                      
male_65_log                  -0.264                                           -0.259                                  
                            (0.310)                                          (0.291)                                  
                                                                                                                      
fem_ratio                    0.017                                            -0.005                                  
                            (0.013)                                          (0.013)                                  
                                                                                                                      
clust1                       -0.432                  -0.145                  0.568**                  0.677***        
                            (0.562)                  (0.689)                 (0.271)                  (0.245)         
                                                                                                                      
clust2                       -0.235                   0.219                  -0.452**                 -0.494**        
                            (0.457)                  (0.573)                 (0.225)                  (0.205)         
                                                                                                                      
clust3                       -0.167                   0.231                   0.104                    0.067          
                            (0.666)                  (0.621)                 (0.200)                  (0.184)         
                                                                                                                      
clust4                       0.047                   -0.102                                                           
                            (0.431)                  (0.602)                                                          
                                                                                                                      
Constant                    -4.026*                 -2.652**                 -3.311*                 -2.818***        
                            (2.162)                  (1.154)                 (1.792)                  (0.789)         
                                                                                                                      
----------------------------------------------------------------------------------------------------------------------
Observations                  938                      938                     388                      388           
R2                           0.192                    0.183                   0.433                    0.428          
Adjusted R2                  0.174                    0.175                   0.403                    0.409          
Residual Std. Error     3.288 (df = 917)        3.287 (df = 928)         1.366 (df = 368)         1.360 (df = 375)    
F Statistic         10.883*** (df = 20; 917) 23.028*** (df = 9; 928) 14.771*** (df = 19; 368) 23.336*** (df = 12; 375)
======================================================================================================================
Note:                                                                                      *p<0.1; **p<0.05; ***p<0.01

Comparing the results obtained for Warszawa and Kraków, we can see that the adjusted R2 is definitely higher for Kraków than for Warszawa. It should be also noted that Inpost comes from Kraków - more regular and more logical arrangement of machines. Division into clusters is definitely more appropriate for modeling for Kraków. What is really intersting, considering significant variables, more of them have the same impact for both cities (positive impact of number of Ruch and UPS points on number of Inpost points). As expected, total population - more inpost points and in turn, negative impact of parks (green areas) on number of inpost points.

When answering the hypotheses, it should be stated that all of them are positively considered. Inpost has parcel machines arranged in accordance with the competition (primarily Ruch and UPS). Control variables have a significant impact on the number of Inpost parcel machines. The use of GWR led to the verification of the above hypotheses and to the conclusion that spatial drift turned out to be significant.

9. Conclusions

Summarizing all the conclusions, we clearly state InPost parcel points are deployd according to the points of other operators. Considering the control variables, mostly their influence on the number of InPost points is significant. We also confirmed the significance of spatial effects in multivariate econometric models.